How to Optimize Compression on Snowflake, Redshift, and BigQuery
Columnar Compression Explained and Deconstructed
I recently wrote a letter about careful Data Manipulation (DML) and thoughtful Data Definition (DDL) so that Data Engineers can build systems with minimal computational and storage bloat. This letter expands on the DDL of columns for columnar databases.
Columnar databases are often used in Data Warehouses because they work well with analytical queries. Columnar storage is storing data from the same column physically close to each other. When it’s stored closely, a database can generate metrics from this similar column data, faster.
Much like penguins generate and preserve heat when they’re close to each other. Computers have an easier time collecting bits and bytes when they’re close to each other in storage.
Furthermore, databases like Snowflake, Redshift, BigQuery, and MonetDB all provide their own syntax for tuning your columnar compression.
You can set bit and character widths for your columns and implement a lot of the concepts below. Not every concept is going to map to a syntax keyword, but it is still important to understand the internals to build high quality systems.
Column Compression Encoding
Run Length Encoding (RLE)
RLE compresses runs of the same values in a column, to a singular representation. Instead of storing the same value n times, you store it once and the number of times it appears in order until the next value. This is useful if you have a small number of distinct values in your column!
raw: 1 1 1 2 2 2 2 3
compressed: 1 : 3 times 2 : 4 times 3: 1 time
Bit-Vector encoding is most useful when you have a very small domain of values possible. In this method, multiple bit-strings are as long as your column, using zeros and ones to compress your data.
So if you have a domain, or set, of two values in a column with many elements like so: 1 1 3 1 3 1, you can represent your two values in two lists. One list represents when the first value, 1, shows up: 1 1 0 1 0 1. The second list represents when the second value, 3, shows up: 0 0 1 0 1 0.
raw: 1 1 3 1 3 1
compressed: 1 : 1 1 0 1 0 1 3 : 0 0 1 0 1 0
Dictionary encoding is a very common method as it is used with string data types. With dictionary encoding, your column’s data is stored in a dictionary (or key-value store) and is sorted on frequency. Values are represented using an integer compression scheme.
Dictionary encoding sacrifices write speed in favor of reading speed. Ordering and integer translation is computationally expensive on the first write but it speeds up future reads. This is particularly useful in analytical queries that contain strings like addresses or names of things!
Frame of Reference (FOR)
FOR is representing your column values as a constant base number, plus a value. Instead of storing 1001 1002 1004. With FOR, you can store 1000 1 2 4 which is fewer bytes for a database to read through.
raw: 1001 1002 1004
compressed: 1000 : 1 2 4
The Patching Technique
If your domain of values becomes too large, or has outliers, an option is to only compress the most frequent values. This technique is an optional extension of Dictionary and FOR encoding. The patching technique is a two step process of total compression then exception decompression so that your future queries are slick.
The internal ideas are largely similar across all providers so it is important to master the common fundamentals. However, syntax and tools are always in flux. To read about each provider’s compression, check out these links: Snowflake, Redshift, BigQuery, and MonetDB.
Thanks for reading!
Snowflake’s guide to semi-structured data: JSON, ORC, Parquet, XML
Data Discovery made simple with Rust and Markdown with Whale
What does Segment do? And how does it relate to Data Engineering as a profession?
How Spotify platforms their Data Engineering work demo video
About the Author and Newsletter
I automate data processes. I work mostly in Python, SQL, and bash. To schedule a consultancy appointment, contact me through ocelotdata.com.
At Modern Data Infrastructure, I democratize the knowledge it takes to understand an open-source, transparent, and reproducible Data Infrastructure. In my spare time, I collect Lenca pottery, walk my dog, and listen to music.
More at; What is Modern Data Infrastructure.