Music and Tech

Share this post
How to Optimize Compression on Snowflake, Redshift, and BigQuery
angelddaz.substack.com

How to Optimize Compression on Snowflake, Redshift, and BigQuery

Columnar Compression Explained and Deconstructed

Angel D'az
Oct 20, 2020
1
Share this post
How to Optimize Compression on Snowflake, Redshift, and BigQuery
angelddaz.substack.com

I recently wrote a letter about careful Data Manipulation (DML) and thoughtful Data Definition (DDL) so that Data Engineers can build systems with minimal computational and storage bloat. This letter expands on the DDL of columns for columnar databases.

Columnar databases are often used in Data Warehouses because they work well with analytical queries. Columnar storage is storing data from the same column physically close to each other. When it’s stored closely, a database can generate metrics from this similar column data, faster.

Columnar Storage

Much like penguins generate and preserve heat when they’re close to each other. Computers have an easier time collecting bits and bytes when they’re close to each other in storage.

World Penguin Day: How will you celebrate the special penguin in your life?  - CBBC Newsround

Furthermore, databases like Snowflake, Redshift, BigQuery, and MonetDB all provide their own syntax for tuning your columnar compression.

You can set bit and character widths for your columns and implement a lot of the concepts below. Not every concept is going to map to a syntax keyword, but it is still important to understand the internals to build high quality systems.


Column Compression Encoding

  1. Run Length Encoding (RLE)

    RLE compresses runs of the same values in a column, to a singular representation. Instead of storing the same value n times, you store it once and the number of times it appears in order until the next value. This is useful if you have a small number of distinct values in your column!

    raw:
    1 1 1 2 2 2 2 3
    compressed:
    1 : 3 times
    2 : 4 times
    3: 1 time
  2. Bit-Vector Encoding

    Bit-Vector encoding is most useful when you have a very small domain of values possible. In this method, multiple bit-strings are as long as your column, using zeros and ones to compress your data.

    So if you have a domain, or set, of two values in a column with many elements like so: 1 1 3 1 3 1, you can represent your two values in two lists. One list represents when the first value, 1, shows up: 1 1 0 1 0 1. The second list represents when the second value, 3, shows up: 0 0 1 0 1 0.

    raw: 1 1 3 1 3 1
    compressed: 
    1 : 1 1 0 1 0 1
    3 : 0 0 1 0 1 0
  3. Dictionary

    Dictionary encoding is a very common method as it is used with string data types. With dictionary encoding, your column’s data is stored in a dictionary (or key-value store) and is sorted on frequency. Values are represented using an integer compression scheme.

    Dictionary encoding sacrifices write speed in favor of reading speed. Ordering and integer translation is computationally expensive on the first write but it speeds up future reads. This is particularly useful in analytical queries that contain strings like addresses or names of things!

  4. Frame of Reference (FOR)

    FOR is representing your column values as a constant base number, plus a value. Instead of storing 1001 1002 1004. With FOR, you can store 1000 1 2 4 which is fewer bytes for a database to read through.

    raw: 1001 1002 1004
    compressed:
    1000 : 1 2 4
  5. The Patching Technique

    If your domain of values becomes too large, or has outliers, an option is to only compress the most frequent values. This technique is an optional extension of Dictionary and FOR encoding. The patching technique is a two step process of total compression then exception decompression so that your future queries are slick.


Conclusion

The internal ideas are largely similar across all providers so it is important to master the common fundamentals. However, syntax and tools are always in flux. To read about each provider’s compression, check out these links: Snowflake, Redshift, BigQuery, and MonetDB.

Thanks for reading!

Share


Curated Content

  • Snowflake’s guide to semi-structured data: JSON, ORC, Parquet, XML

  • Data Discovery made simple with Rust and Markdown with Whale

  • What does Segment do? And how does it relate to Data Engineering as a profession?

  • How Spotify platforms their Data Engineering work demo video


Share Modern Data Infrastructure

About the Author and Newsletter

I automate data processes. I work mostly in Python, SQL, and bash. To schedule a consultancy appointment, contact me through ocelotdata.com.

At Modern Data Infrastructure, I democratize the knowledge it takes to understand an open-source, transparent, and reproducible Data Infrastructure. In my spare time, I collect Lenca pottery, walk my dog, and listen to music.

More at; What is Modern Data Infrastructure.

Share this post
How to Optimize Compression on Snowflake, Redshift, and BigQuery
angelddaz.substack.com
Comments

Create your profile

0 subscriptions will be displayed on your profile (edit)

Skip for now

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.

TopNewCommunity

No posts

Ready for more?

© 2022 Substack Inc
Privacy ∙ Terms ∙ Collection notice
Publish on Substack Get the app
Substack is the home for great writing