Data Compression: How Fewer Bits Say More
You have a 10MB file. You zip it. Now it's 2MB. What happened to the other 8MB?
They weren't "removed." They were never necessary in the first place. Your original file was wasteful, using more bits than the information actually required. Compression finds that waste and eliminates it.
This post explores variable-length codes, the core technique behind most lossless compression. The idea is simple: give common symbols short codes and rare symbols long ones. The execution is where it gets interesting.
The telegraph operator's wrist
In 1836, Samuel Morse faced a problem. His telegraph could only send one thing: pulses of electricity. Short pulse or long pulse. Dot or dash. He needed to encode the entire English alphabet with just these two signals.
The naive approach: assign each letter a fixed-length code. A gets 5 pulses, B gets 5 pulses, Z gets 5 pulses. Fair and equal.
But telegraph operators had to tap these codes by hand, hundreds of messages per day. Their wrists were wearing out. Morse noticed something: the letter E appears in English text about 13% of the time. The letter Z appears 0.07% of the time. If operators are going to tap E a hundred times more often than Z, shouldn't E be faster to tap?
So E became a single dot. T became a single dash. Meanwhile, Q got four symbols (dash dash dot dash) because nobody sends many Qs.
This was one of the first variable-length codes in history. And the principle still powers every compression algorithm you use today.
Frequency determines code length
The intuition is straightforward. If symbol A appears 90% of the time and symbol B appears 10% of the time, your data stream is mostly As. Give A a 1-bit code and B a 2-bit code. On average, you spend about 1.1 bits per symbol instead of the 1 bit you'd need for fixed-length encoding.
Wait, that's worse, not better.
But consider the alternative. With fixed-length codes, you need at least 1 bit to distinguish between two symbols. So you'd spend exactly 1 bit per symbol regardless of frequency. With variable-length codes, you can sometimes beat this, sometimes match it, and sometimes do worse.
The trick is matching your code lengths to your actual symbol frequencies.
Notice the pattern. E, T, A, O, I, N get short codes. X, Q, Z get long ones. Morse figured this out by counting letter frequencies in a Philadelphia newspaper's type cases. More Es meant more E-type blocks in the printer's tray.
Entropy: the theoretical limit
Before we build our own codes, we need to know: how good can compression get?
Claude Shannon answered this in 1948. He defined entropy as the average amount of information per symbol in a data set. The formula looks intimidating, but the intuition is simple.
If your data is completely predictable (all As, for example), entropy is zero. You need zero bits to communicate something the receiver already knows.
If your data is completely random (equal probability of every symbol), entropy is at its maximum. Each symbol is a surprise, and surprises cost bits.
This entropy value is your floor. No lossless compression can do better than entropy bits per symbol, on average. Huffman coding gets close. Arithmetic coding gets even closer.
But wait, I compressed a file to smaller than its entropy would suggest?
That means the file had structure beyond simple symbol frequencies. Maybe repeated phrases, patterns, or predictable sequences. Advanced compressors exploit these too, but that's a topic for dictionary-based compression, not this post.
Building a histogram
Compression starts with counting. You walk through your data and tally how many times each symbol appears. This is your histogram.
From the histogram, you calculate probabilities. Symbol O appears 3 times out of 9 total? Its probability is 3/9, or about 33%. This probability determines how many bits O should get.
The connection between probability and code length is: high probability means short code. The formula is roughly: code length for symbol S ≈ -log₂(P(S)).
If a symbol appears 50% of the time, -log₂(0.5) = 1 bit. If it appears 25% of the time, -log₂(0.25) = 2 bits. The math works out elegantly.
The prefix property
There's a catch. You can't just assign any bit sequences as codes. Consider this assignment:
- A = 0
- B = 01
- C = 10
Now decode the bit stream 0101. Is it:
- AB (0, 01, ?)
- AAC (0, 0, 10, ?)
- Something else?
The problem: A's code (0) is a prefix of B's code (01). When the decoder reads a 0, it can't tell if this is a complete A or the start of a B.
Variable-length codes must obey the prefix property: no code can be a prefix of any other code. This ensures unambiguous decoding. You read bits until you match a code, output the symbol, and start fresh.
The tradeoff: prefix-free codes tend to be slightly longer than theoretical minimum. You give up some compression for the ability to actually decode your data.
Huffman coding
In 1951, David Huffman was a graduate student at MIT. His professor, Robert Fano, gave the class a choice: take a final exam, or write a term paper finding the optimal variable-length code.
Huffman struggled for months. He tried dozens of approaches but couldn't prove any of them optimal. The night before he planned to give up and study for the exam, he threw his notes in the trash.
Then it clicked.
The insight: build the code bottom-up. Start with your symbols sorted by frequency. Take the two least frequent symbols, combine them into a new node whose frequency is their sum. Repeat until you have one node. The path from root to each leaf (left = 0, right = 1) gives you the code.
This produces the optimal prefix-free code for a given set of symbol frequencies. Optimal means: no other prefix-free code has a shorter average length.
The key properties:
- More frequent symbols end up closer to the root (shorter paths, shorter codes)
- Less frequent symbols end up as leaves lower in the tree (longer paths, longer codes)
- The resulting codes automatically satisfy the prefix property
Encoding and decoding
With your Huffman codes built, encoding is mechanical. Walk through the input, look up each symbol's code in your table, emit the bits.
Decoding reverses the process. Read bits one at a time. Walk down the tree (0 = left, 1 = right) until you hit a leaf. Output that symbol. Return to the root and repeat.
One crucial detail: the decoder needs the code table. You have to transmit it alongside your compressed data. For short messages, this overhead might exceed your savings. Huffman coding works best on longer data where the table cost is amortized.
Huffman's limitations
Huffman coding assigns integer bit lengths to symbols. A symbol gets 1 bit, or 2 bits, or 5 bits. Never 1.3 bits.
But entropy might tell you a symbol deserves 1.3 bits. When probabilities aren't neat powers of two (1/2, 1/4, 1/8...), Huffman wastes some space rounding up.
For a symbol with probability 0.4, optimal coding would use -log₂(0.4) ≈ 1.32 bits. Huffman gives it either 1 or 2 bits. That gap adds up.
This is where arithmetic coding enters.
Arithmetic coding
Arithmetic coding takes a different approach. Instead of assigning a codeword to each symbol, it encodes the entire message as a single number between 0 and 1.
The encoding process repeatedly subdivides the number line. Each symbol narrows the interval based on its probability. After processing all symbols, any number within the final interval decodes to the original message.
The clever part: the final interval's size equals the product of all symbol probabilities. And -log₂(interval size) gives you the number of bits needed to specify a number in that interval.
This means arithmetic coding approaches entropy almost exactly, without the rounding waste of Huffman.
The tradeoff: arithmetic coding is computationally heavier and was patented until the early 2000s. Modern formats like LZMA, JPEG, and H.264 all use arithmetic coding variants.
Comparing approaches
Different encodings suit different data. Fixed-length is simplest but ignores frequency. Unary works when one symbol dominates. Huffman balances optimality with simplicity. Arithmetic coding squeezes out the last fraction of a bit.
Real compressors often combine techniques. DEFLATE (used in zip, gzip, PNG) uses LZ77 for finding repeated patterns, then Huffman for encoding the results. The two stages complement each other.
VarInt: practical variable-length integers
One more technique worth knowing: VarInt, popularized by Google's Protocol Buffers.
VarInt isn't about symbol frequencies. It's about integer size. Small numbers use fewer bytes than large numbers. The most significant bit of each byte signals whether more bytes follow.
This is useful when most of your integers are small. Database IDs, lengths, counts, offsets. Instead of always using 4 or 8 bytes, you use 1 byte for small values and expand as needed.
VarInt appears in Protocol Buffers, SQLite, WebAssembly, and many internal formats at Google. It's a simple idea with wide applicability.
Beyond frequency: finding patterns
Variable-length codes like Huffman work by exploiting symbol frequencies. Common letters get short codes. But what about repeated phrases?
Consider: "TO BE OR NOT TO BE"
Huffman sees 13 individual letters and compresses each based on how often it appears. But a human sees "TO BE" repeated twice. If we could encode "TO BE" as a single token, we'd compress better than counting letters ever could.
This is where dictionary transforms enter. Instead of encoding individual symbols, they find repeated substrings and replace them with shorter references.
LZ77: The sliding window
In 1977, Abraham Lempel and Jacob Ziv published an algorithm that changed compression forever. LZ77 doesn't care about symbol probabilities. It looks for one thing: have I seen this exact sequence of bytes before?
The algorithm splits your data into two parts:
- Search buffer: bytes you've already processed (last 32KB or so)
- Look-ahead buffer: bytes you're about to encode (next few dozen bytes)
When encoding, you search backward in the search buffer for the longest match to what's in your look-ahead buffer. When you find a match, you emit a token: [offset, length] instead of the raw bytes.
Example: encoding "TOBEORNOTTOBEORTOBEORNOT"
When you hit the second "TOBE", you look back and find the first "TOBE" 9 bytes earlier. Instead of encoding T-O-B-E again (4 bytes), you output [9, 4], meaning "copy 4 bytes from 9 positions back."
This is how zip, gzip, and PNG compress data. The LZ step finds repeated patterns. Then Huffman or arithmetic coding compresses the resulting tokens.
Why the sliding window?
You can't keep the entire file in memory. A 1GB file would require 1GB of search space. The sliding window (typically 32KB) is a compromise. You can find recent repeated patterns without unbounded memory.
The assumption: correlated data clusters together. If you typed "the" once, you'll probably type "the" again soon. But "the" from page 1 and "the" from page 1000 are too far apart to match within a 32KB window.
This makes LZ77 fast. You only search a fixed-size window, so lookup time is bounded. And for most real data, nearby repetitions are the ones worth finding.
When LZ77 dominates
LZ77 works best on data with repeated patterns:
- Text files (repeated words and phrases)
- HTML/XML (repeated tags)
- Source code (repeated function names, keywords)
- Uncompressed images (pixels in flat color regions)
It works poorly on:
- Already-compressed data (zip of a zip does nothing)
- Encrypted data (looks random, no patterns)
- Truly random data (photos from a camera sensor's noise)
LZ77 is finding context, not just frequency. The sequence "TOBE" isn't special because those letters are common. It's special because that exact sequence repeats.
Run-length encoding
Sometimes the pattern is trivial: the same symbol repeats many times in a row.
AAAAABBBBBCCC could be represented as [A,5][B,5][C,3]. This is run-length encoding (RLE), one of the oldest compression techniques.
RLE appears in fax machines (pages are mostly white), early image formats (BMP used it), and as a preprocessing step in other compressors.
The problem: short runs. If your data is "ABCABC", RLE makes it worse. Each symbol becomes a [symbol, length] pair. You've doubled the size.
Solution: add a bit flag for each run. 1 means "this is a run, read the length." 0 means "this is a literal symbol, no length follows." This way, non-repeated symbols only cost 1 extra bit, not a full length field.
When RLE shines
RLE works when you have long runs of identical values:
- Black and white images (think scanned documents)
- Simple graphics with solid color regions
- Video frames with large unchanged areas (delta compression between frames)
It fails on text, photos, or anything where values change frequently. Modern compressors rarely use pure RLE, but it appears as a subroutine inside more complex algorithms.
Delta coding: numbers that don't jump around
Numeric data is painful to compress. Consider GPS coordinates: [51.5074, 51.5073, 51.5072, 51.5071]. Each number is basically unique, so Huffman sees max entropy.
But look at the differences: [51.5074, -0.0001, -0.0001, -0.0001]. Now there's a pattern.
Delta coding stores the difference between adjacent values instead of the values themselves. If your data changes slowly (temperatures, sensor readings, sequential IDs), the deltas are small, and small numbers compress better.
Example: [100, 102, 101, 103, 105]
Fixed encoding: need log₂(105) = 7 bits per value = 35 bits total
Delta encoding: [100, +2, -1, +2, +2] First value: 7 bits Deltas: each fits in 2 bits (because ±2 fits in 2 bits signed) Total: 7 + 4×2 = 15 bits
The deltas have smaller range, so they need fewer bits. Then you can apply VarInt or Huffman to the delta stream.
Delta variants
Basic delta coding (subtract previous value) works, but variants handle edge cases:
XOR delta: Use bitwise XOR instead of subtraction. Never produces negative values, which is handy for unsigned integers.
Frame of reference (FOR): Subtract the minimum value from all values. If your data is [1000, 1001, 1002], subtract 1000 to get [0, 1, 2], which fits in 2 bits instead of 10.
Patched FOR (PFOR): Like FOR, but handle outliers separately. If your data is [1, 2, 3, 999], don't let the 999 force everything into 10 bits. Store most values in 2 bits and the outlier in a separate exception list.
These appear in databases (compressed column stores), time-series databases (InfluxDB, Prometheus), and search engines (inverted index compression).
Move-to-front: adaptive reordering
Here's a weird one. Move-to-front (MTF) doesn't compress directly. It transforms your data to make it more compressible.
MTF keeps a list of all possible symbols, initially in some order (say, alphabetical). As you read each symbol from the input:
- Find its position in the list
- Output that position (0, 1, 2...)
- Move that symbol to the front of the list
Example: encoding "banana" with initial list [a,b,c,d,e,f,g,...,z]
- Read 'b': position 1, output 1, list becomes [b,a,c,d,e,f,g,...,z]
- Read 'a': position 1, output 1, list becomes [a,b,c,d,e,f,g,...,z]
- Read 'n': position 13, output 13, list becomes [n,a,b,c,d,e,f,g,...,z]
- Read 'a': position 1, output 1, list becomes [a,n,b,c,d,e,f,g,...,z]
- Read 'n': position 1, output 1, list becomes [n,a,b,c,d,e,f,g,...,z]
- Read 'a': position 1, output 1, list becomes [a,n,b,c,d,e,f,g,...,z]
Output: [1, 1, 13, 1, 1, 1]
Notice the pattern: the output has lots of small numbers. Recently-used symbols cluster at the front, so they get small position values. Small values compress well.
MTF appears in bzip2 as a preprocessing step before Huffman. It doesn't compress on its own, but it rearranges data so other compressors work better.
Burrows-Wheeler Transform: the magic shuffle
BWT is the strangest algorithm in this post. It doesn't exploit frequency or find patterns. It just... rearranges your data. And somehow the rearranged version compresses better.
Here's how it works:
- Take your input string
- Generate all possible rotations of it
- Sort those rotations alphabetically
- Take the last column
That last column is your BWT output. It looks scrambled, but it has a special property: identical characters cluster together.
Example: BWT of "banana"
Rotations (sorted):
abanan
anaaba
anaban
banana
nabana
nanaba
Last column: "annbaa"
Notice: the input "banana" had letters scattered around. The output "annbaa" has all the 'a's clustered together, all the 'n's together. This makes RLE and MTF work better.
BWT is the secret weapon inside bzip2. The algorithm is:
- BWT to cluster identical symbols
- MTF to convert clusters into small numbers
- Huffman to compress the small numbers
The decompression process can reverse BWT perfectly, which is non-obvious but provably works.
Putting it all together
Modern compressors stack these techniques:
gzip (DEFLATE):
- LZ77 to find repeated strings
- Huffman to encode the tokens
bzip2:
- BWT to cluster similar symbols
- MTF to convert to small numbers
- RLE for long runs of identical values
- Huffman for final compression
7-Zip (LZMA):
- LZ77 with a huge dictionary (megabytes, not kilobytes)
- Range coding (like arithmetic coding) for tokens
Each layer exploits a different property of the data. LZ77 finds repeated phrases. BWT clusters similar symbols. MTF converts clusters to small numbers. Huffman assigns short codes to common numbers.
No single technique compresses everything. But by layering them, you can handle text (repeated words), images (similar pixels nearby), and structured data (repeated XML tags).
What we learned
Compression is about exploiting structure:
-
Variable-length codes (Huffman, arithmetic) exploit symbol frequency
-
Dictionary transforms (LZ77, LZ78) exploit repeated substrings
-
Contextual transforms (RLE, delta, MTF, BWT) exploit local correlation
-
Real compressors stack multiple techniques because different data has different structure
Variable-length codes got us from 1950s telegraph operators to 1980s. Dictionary and contextual transforms got us from 1980s to today's zip files and PNG images.
The next time you compress a file, you'll know: the computer is searching for patterns (LZ77), clustering similar bytes (BWT), converting clusters to small numbers (MTF), and assigning short codes to common values (Huffman). Each step finds structure the previous step missed.
It's not magic. It's just really careful pattern matching.
Context makes everything better
All the techniques so far treat each symbol independently. Huffman asks: "How often does E appear?" But it doesn't ask: "How often does E appear after T?"
That's a different question with a different answer. E appears 13% of the time overall. But after T, E appears much more often. "THE" is common. "THX" is not.
This is where context-based encoding enters. Instead of one probability table for all symbols, you have multiple tables. One for "what follows T", one for "what follows H", and so on.
Markov chains: predicting the next symbol
In 1913, Andrey Markov analyzed a novel by Alexander Pushkin, counting how often each letter followed each other letter. He discovered you could predict the next letter better if you knew the previous letter.
This is a Markov chain: the probability of the current state depends only on the previous state, not the entire history.
The power: after Q, U has 99% probability. You can encode U with nearly zero bits because you already know it's coming.
The cost: you need a separate probability table for every possible previous symbol. One table for "after A", one for "after B", and so on. That's 26 tables for English letters, 256 tables for arbitrary bytes.
But the compression wins can be huge. Common pairs like TH, HE, AN get super short codes.
Multi-context encoding
You can go further. Instead of just the previous symbol, look at the previous two symbols. Or three. Or N.
This is called an N-order context:
- Order-1: probability based on previous 1 symbol
- Order-2: probability based on previous 2 symbols
- Order-3: probability based on previous 3 symbols
Example: encoding "THE" in a 2-context model:
- T uses overall probability (no previous context)
- H uses probability given previous symbol T
- E uses probability given previous two symbols TH
"TH" is common, so H after T gets a short code. "THE" is even more common, so E after TH gets an even shorter code.
The pattern: the more context you use, the better your predictions. But also the more memory and computation you need.
Prediction by Partial Matching (PPM)
The practical implementation of Markov chains for compression is called PPM, invented in 1984 by John Cleary and Ian Witten.
PPM solves a key problem: what if you have 1000 bytes of context data, but you've never seen those exact 1000 bytes before? You'd have zero probability, which breaks compression.
PPM's solution: try shorter contexts until you find a match.
Algorithm:
- Read next symbol S
- Look at previous N symbols (your N-order context)
- Check: have you seen S after this context before?
- If yes, encode S with that probability
- If no, emit an "escape code" and try N-1 context
- Repeat until you find a match or run out of context
This is called backing off to shorter contexts.
PPM stores contexts in a trie (pronounced "try"), a tree where each path represents a context. The trie tracks how many times each symbol appeared after each context.
Benefits of the trie:
- Fast lookup (just walk down the tree)
- Automatic deduplication (shared prefixes)
- Easy to update (increment counts as you encode)
Downsides:
- Memory grows with unique contexts
- Deep contexts (N=8+) can explode memory
Most PPM implementations use N=5 or N=6 as a sweet spot between compression and memory.
Context mixing: combining models
Here's where it gets wild. What if you use multiple different models at once?
Context mixing combines predictions from different models to get a better overall prediction. Each model looks at the data differently:
- Model 1: previous 3 characters (Markov chain)
- Model 2: word boundaries (spaces, punctuation)
- Model 3: position in file (start, middle, end)
- Model 4: special patterns (HTML tags, code syntax)
Each model predicts: "I think symbol X has probability P." Then you combine all predictions using weighted averaging or neural networks.
The PAQ family of compressors uses this approach with dozens of models. Different models for text, images, executables, multimedia. They achieve incredible compression ratios at the cost of massive memory (10GB+ for large files) and slow speed (hours to compress 1GB).
Why it works: different data has different structure. English text benefits from word-level models. Binary executables benefit from instruction-pattern models. Images benefit from 2D spatial models. Combining them all means you exploit all the structure, not just one type.
The frontier: adaptive context mixing
Modern context mixers don't use fixed weights. They use machine learning to adjust weights in real-time based on which models are predicting correctly.
Process:
- Each model predicts probability for current symbol
- Weighted average produces final prediction
- Encode symbol using final prediction
- Check which models were right
- Increase weights for accurate models
- Decrease weights for inaccurate models
- Repeat for next symbol
This is basically training a neural network while compressing. The weights adapt to the specific file you're compressing.
Result: ZPAQ compressed 1GB of text to ~200MB (5:1 ratio) using 14GB of RAM and 6 hours of CPU time. That's better than gzip (3:1 ratio, 50MB RAM, 10 seconds), but at extreme cost.
Why you don't use PPM for everything
PPM and context mixing achieve the best compression ratios possible for statistical methods. So why don't we use them everywhere?
Speed: PPM is slow. Context mixing is slower. gzip compresses 1GB in 10 seconds. PPM takes minutes. Context mixing takes hours.
Memory: gzip uses 1-8MB. PPM uses 50-500MB. Context mixing uses gigabytes.
Decoding: you must rebuild all the same context tables during decompression. If you used 14GB to compress, you need 14GB to decompress.
Diminishing returns: For small files (<10MB), the overhead of building complex models exceeds the compression win. For random data, no amount of context helps.
Use cases: PPM shines for large, highly structured text (books, source code, medical records). Context mixing shines for archiving massive datasets where storage cost matters more than time.
Most of the time, you want the practical middle ground: LZ77 + Huffman (gzip) or LZ77 + arithmetic coding (7-Zip). Fast, low memory, decent compression.
Data modeling: knowing your data
The ultimate lesson: compression requires understanding your data's structure.
- Text: use dictionary compression (LZ77) + Huffman
- Structured text (XML, JSON): use PPM or specialized text compressors
- Source code: use context mixing with code-aware models
- Images: use specialized image codecs (JPEG, WebP, AVIF)
- Video: use specialized video codecs (H.264, H.265, AV1)
- Audio: use specialized audio codecs (MP3, AAC, Opus)
Generic compressors (gzip, 7-Zip, zstd) work on anything, but specialized compressors win when you exploit domain knowledge.
The best compressor is the one that models your specific data best.
The bigger picture
We started with Morse code in the 1830s. One insight: common letters get short codes.
Then Huffman in 1952. Build optimal codes from symbol frequencies.
Then LZ77 in 1977. Find repeated substrings, not just symbols.
Then PPM in 1984. Use previous symbols to predict current symbol.
Then context mixing in 2000s. Combine dozens of models with machine learning.
Each generation found new structure to exploit:
- Morse: symbol frequency
- Huffman: optimal prefix codes
- LZ77: repeated phrases
- MTF/BWT: clustering similar symbols
- PPM: context-based prediction
- Context mixing: multi-model ensemble
Where does it go from here? Probably deeper integration with machine learning. Pre-trained models that understand English, code, images. Models that compress a million files and learn patterns across all of them, not just one file at a time.
But the fundamentals won't change. Compression is pattern recognition. The better you understand the patterns in your data, the better you can compress it.
And that's the journey from "zip a file" to "understand how the bits disappear." It's statistics, tree structures, sliding windows, context tables, and a bit of machine learning. Each piece finds structure the previous piece missed.
Not magic. Just very, very careful pattern matching.
Two worlds of compression
Everything we've covered so far applies to lossless compression: you get the exact original data back. Not a bit lost. This is essential for text, code, databases, and anything where a single flipped bit could mean disaster.
But there's another world out there. A world where "close enough" is good enough.
Media-specific compression dominates the bits flowing through your devices. Images, audio, video. These files don't need to be perfect. They need to look good, sound good, and fit through a network pipe.
General-purpose compression handles everything else. Text files, source code, serialized data, binary blobs. Things that can't tolerate any loss.
Let's explore both worlds and see how they differ.
The picture problem
Here's a sobering comparison. A single 1024 × 1024 RGB image, uncompressed, is 3 megabytes. Each pixel needs 24 bits: 8 for red, 8 for green, 8 for blue.
How much text fits in 3 megabytes?
The famous book The Hobbit contains about 95,000 words. At roughly 5 characters per word, that's 475,000 characters. In ASCII, that's 475 kilobytes.
You could fit The Hobbit about six times into a single 1024 × 1024 image.
This is why media compression matters so much. Images, audio, and video dominate network traffic. A webpage might have 50KB of HTML and 5MB of images. Your phone's storage is 90% photos and videos.
The old saying "a picture is worth a thousand words" is literally true in data terms. Actually, it's worth much more.
Lossy compression: trading quality for size
Media compressors cheat. They don't preserve every bit. They reduce quality in ways humans barely notice, making the data dramatically more compressible.
Consider that 1024 × 1024 image. Each pixel uses 24 bits (8 bits per color channel), giving 16.7 million possible colors.
What if we used only 4 bits per channel instead? Now we have 12 bits per pixel, and the file shrinks to 1.5 MB. Half the size.
The cost: we've reduced from 16.7 million colors to 4,096 colors. That sounds drastic, but the human eye can only distinguish about a million colors. For many images, you won't notice the difference.
This is quantization, reducing the precision of values. It's one of dozens of lossy transforms:
- Quantization: Reduce bit depth (fewer colors, less precise audio samples)
- Downsampling: Reduce resolution (smaller image dimensions)
- Frequency domain: Keep important frequencies, discard details (JPEG, MP3)
- Perceptual modeling: Remove what humans can't perceive (psychoacoustic models in audio)
Each media type has specialized transforms. What works for images fails on audio. What works on speech fails on music. That's why we have JPEG for photos, WebP for web images, MP3 for music, Opus for voice, H.264 for video, and a hundred other formats.
Isn't this just throwing away data?
Yes! But strategically. The trick is discarding information that doesn't affect human perception. A JPEG removes high-frequency color details that your eyes can't resolve anyway. An MP3 removes frequencies masked by louder sounds. Done well, lossy compression is invisible.
Done poorly, you get JPEG artifacts, blocky videos, and tinny audio. The compression community spends enormous effort on "transparent" compression: losses you can't perceive.
General-purpose compression: the lossless world
General-purpose compressors can't cheat. They handle data where every bit matters:
- Text and code: A missing semicolon breaks your program
- Databases: A corrupted record loses customer data
- Executables: A flipped bit crashes your application
- Encrypted data: Any change makes decryption fail
These compressors use everything we've discussed: LZ77, Huffman, arithmetic coding, BWT, context mixing. No lossy transforms allowed.
The major general-purpose compressors:
DEFLATE (1996): LZ77 + Huffman. Used in ZIP, gzip, PNG. Fast, universal, good enough.
BZIP2 (1996): BWT + MTF + Huffman. Better compression, slower, uses more memory.
LZMA (1998): LZ77 with huge dictionaries + range coding. Used in 7-Zip, XZ. Excellent compression, slow.
Zstandard (2016): Modern LZ77 variant + finite state entropy. Fast like gzip, compresses like LZMA.
Brotli (2015): LZ77 + context modeling + Huffman/ANS. Designed for web content. Better than gzip at similar speeds.
Most internet traffic you download has been compressed with one of these. HTTP allows servers to send gzip, brotli, or zstd-encoded responses. Your browser decompresses transparently.
The race to diminishing returns
Here's an uncomfortable truth. General-purpose compression is hitting a wall.
Look at the recent successful encoders: Brotli, LZHAM, LZFSE, Zstandard. They all show the same pattern: minor variations on established themes. A new trick here, a modified transform there. Small savings.
The Large Text Compression Benchmark tracks dozens of compressors on a 1GB text file. The improvements are incremental. We're not seeing breakthroughs of 30-50% better compression. We're seeing 2-10% improvements, at the cost of more memory and CPU time.
Google's compression team has released Snappy, Zopfli, Gipfeli, and Brotli. Each one optimizes for a different point on the tradeoff curve:
- Snappy: Very fast, modest compression. For databases and caches.
- Zopfli: Very slow, excellent compression. For static assets you compress once.
- Brotli: Balanced. The new HTTP compression standard.
The theoretical limit is entropy. Real data has about 1-2 bits of information per character of English text. The best compressors get close to this limit. There's not much room left.
Why this matters for you
If you're building applications, compression is everywhere:
Network traffic: Every API call, every webpage, every asset download. HTTP compression (gzip, brotli) saves bandwidth and speeds up load times.
Storage: Databases, log files, backups. Compression reduces costs and increases effective capacity.
User experience: Faster downloads mean happier users. A 5MB page load feels sluggish. A 500KB page feels snappy.
The practical takeaways:
-
Enable HTTP compression. It's free performance. Most servers support it by default.
-
Choose the right image format. WebP and AVIF beat JPEG and PNG for web images. Use them.
-
Don't compress compressed data. Zipping a JPEG does nothing. The bytes are already compressed.
-
Match compressor to use case. Zstd for real-time. Brotli for static assets. Don't use PAQ for runtime compression.
-
Profile your data. Is it text-heavy? Image-heavy? Structured? The optimal compression strategy depends on what you're compressing.
The journey continues
We've traveled from telegraph operators counting letter frequencies to neural networks mixing context models. From 1836 to today.
The fundamentals remain constant: compression is pattern recognition. Find structure, exploit it, use fewer bits.
What's changed is our tools. Morse used intuition. Huffman used trees. Modern compressors use machine learning. Each generation finds patterns the previous generation missed.
The next frontier? Pre-trained models that understand content. Compressors that know English grammar, code syntax, image semantics. Models trained on billions of files, bringing that knowledge to compress yours.
But that's a topic for another post. For now, you know how your files shrink when you click "Compress." It's not magic. It's a century and a half of clever pattern matching, one technique stacked on another.
And every time you download a webpage, send a photo, or stream a video, these algorithms are working in the background. Finding patterns. Eliminating waste. Making the internet possible.
Not bad for a telegraph operator's sore wrist.
Deep dive: Image compression in practice
If you're building applications, chances are most of your content is images. Social media feeds, product catalogs, maps, avatars. Images dominate. And while we've covered the theory, there's a whole world of practical decisions that determine whether your images are 50KB or 500KB.
Let's get specific.
The quality slider dilemma
Every image compressor gives you a quality slider. Move it left, files get smaller. Move it right, quality improves. Simple, right?
Not quite. The relationship between quality setting and file size isn't linear. And the relationship between quality setting and perceived quality is even weirder.
Here's the uncomfortable truth: most of the quality slider is useless. Between quality 100 and 75, file size drops dramatically while perceived quality barely changes. Below 75, you start noticing artifacts. Below 50, things get ugly fast.
The imgmin project studied this systematically and found that for most JPEGs, there's minimal perceptible difference between quality 100 and 75, but the file size can be half.
What the industry actually uses
So what quality settings do the big companies use? They've done the math, tested on millions of images, and measured user complaints. Here's what they settled on:
Notice anything? Everyone clusters around 70-85. Not 100. Not even 90. The lesson: if Google, Facebook, and YouTube are comfortable at quality 75, you probably are too.
But here's the catch: these are single quality values applied to all images. A photo of a sunset and a photo of a forest have different compression characteristics. One might look great at quality 65, the other might need 80. The best systems adapt per-image.
What makes images look bad?
When compression goes wrong, you notice. But what exactly are you noticing?
Blocking artifacts: JPG divides images into 8×8 pixel blocks. At low quality, adjacent blocks can have visibly different colors, creating a grid pattern.
Banding: Smooth gradients become stair-stepped when there aren't enough color levels to represent the transition.
Ringing: Halos appear around sharp edges, especially in high-contrast areas.
The Python logo is actually a great test case. It has smooth gradients (the blue and yellow snakes), sharp edges (the outline), and flat color regions. Each of these responds differently to compression.
Flat regions compress beautifully. Edges create ringing at low quality. Gradients show banding. A single quality setting can't optimize for all three.
Measuring quality: beyond "looks good"
Artists can eyeball quality. Algorithms can't. So we need mathematical metrics.
PSNR (Peak Signal-to-Noise Ratio): Measures the difference between original and compressed pixel values. Higher is better. Simple to compute, but doesn't match human perception well. A blurry image can score high because it's "close" to the original in pixel values, even though it looks awful.
SSIM (Structural Similarity Index): Compares edges and structural patterns, not just pixel values. Ranges from 0 (completely different) to 1 (identical). Much better at predicting what humans will notice.
PSNR measures mathematical accuracy. SSIM measures perceptual accuracy. An image can have the same PSNR but very different SSIM scores, and SSIM better predicts which one looks better to humans.
This matters for automation. If you're building a system that compresses millions of images, you can use SSIM to find the lowest quality setting that stays above a perceptual threshold. Per-image optimization without human review.
Image dimensions: the forgotten variable
Here's a scenario: a user uploads an 8-megapixel photo from their phone. Another user wants to view it as a 100×100 thumbnail on their feed.
Do you:
- Send the full 8MP image and let the device resize it?
- Resize on the server and send a 100×100 image?
Option A is easy to implement. Option B saves massive bandwidth.
The math is brutal. An 8MP image (3264×2448) at 24 bits per pixel is 24MB uncompressed. A 100×100 thumbnail is 30KB uncompressed. You're sending 800 times more data than necessary.
Even compressed, the full-size image might be 2MB. The thumbnail might be 5KB. That's 400× overhead for data the user will never see.
Modern systems generate multiple resolutions:
- Thumbnail (100px): for feed previews
- Small (400px): for mobile views
- Medium (800px): for tablet/desktop
- Large (1600px): for fullscreen
- Original: for downloads
Each resolution gets its own quality optimization. Thumbnails can tolerate lower quality because artifacts are invisible at small sizes.
Choosing the right format
Four formats dominate the web: PNG, JPG, GIF, and WebP. Each has tradeoffs.
PNG: Lossless. Supports transparency. Great for logos, icons, screenshots. Poor for photos (files are huge).
JPG: Lossy. No transparency. Excellent for photos. Hardware decode support on most devices. The web's workhorse.
GIF: Limited to 256 colors. Supports animation. Largely obsolete except for simple animations. (LZW patent drama made it controversial for a decade.)
WebP: Google's format. Supports both lossy and lossless modes. Supports transparency. 25-35% smaller than equivalent JPG. Not universally supported (though adoption is now widespread).
The decision tree is straightforward:
- Need transparency? → WebP if supported, otherwise PNG
- Is it a photo? → WebP if supported, otherwise JPG
- Simple graphics with few colors? → PNG (lossless beats lossy here)
- Need animation? → WebP (animated) or video formats
The format wars are real. When Google introduced WebP in 2013, Mozilla refused to support it, calling it insufficient improvement over optimized JPG. They even created MozJPEG to prove their point. Eventually, data won. Firefox added WebP support in 2019 after sites demonstrated consistent 25%+ savings.
Vector vs raster: a different approach
Everything we've discussed assumes raster images: grids of pixels. But there's another way.
Vector images store drawing instructions, not pixels. "Draw a circle at (50,50) with radius 30" instead of "here are 2,827 pixel colors."
The Python logo is actually available as an SVG (vector format). The SVG file is about 10KB. A high-quality PNG of the same logo might be 50KB+. And the SVG scales to any size without quality loss.
Vector advantages:
- Infinite scaling: No pixelation at any size
- Tiny file size: For simple graphics
- Editability: Colors and shapes can be modified
Vector limitations:
- Only works for graphics: Can't represent photos
- Rendering cost: Must be converted to pixels at display time
- Complexity ceiling: Highly detailed images become larger than rasters
Use vectors for: logos, icons, illustrations, UI elements, maps. Use rasters for: photos, complex artwork, screenshots.
Compressing structured data
Images get all the attention, but there's another category of data flowing through your applications: serialized structured data. API responses. Database dumps. Configuration files. The JSON and XML that glue modern systems together.
This data compresses differently than images or raw text. It has patterns that generic compressors miss. And there are techniques, some simple, some clever, that dramatically reduce its footprint.
What is serialization?
Serialization converts in-memory data structures into byte sequences that can be stored or transmitted. Deserialization reverses the process.
Your application has objects, arrays, dictionaries. The network only understands bytes. Serialization bridges the gap.
The data falls into four categories based on who creates it and how it changes:
Dynamic server-built: API responses generated per-request. Product listings, search results, user profiles. Changes constantly, can't be pre-compressed.
Static server-owned: Configuration files, translation tables, asset manifests. Changes infrequently, can be heavily optimized.
Dynamic client-built: User input, form submissions, analytics events. Unpredictable content from untrusted sources.
Static client-owned: Local caches, saved preferences. Client controls the format completely.
Each category has different optimization strategies. Let's start with the biggest offender.
The JSON problem
JSON won. It's everywhere. Every API, every config file, every log line. Human-readable, self-describing, universally supported.
Also: incredibly wasteful.
Consider a simple user object:
{"id":12345,"name":"Alice","email":"alice@example.com","active":true}That's 64 bytes. The actual data is:
id: 4 bytes (32-bit integer)name: 5 bytes (string)email: 17 bytes (string)active: 1 byte (boolean)
Total payload: 27 bytes. JSON overhead: 37 bytes. We're sending 137% more bytes than necessary.
And that's a simple case. Arrays of objects repeat the same field names over and over. A list of 1000 users repeats "id":, "name":, "email":, "active": a thousand times.
Beyond size: decode costs
There's another problem. JSON must be parsed character by character. The decoder reads ", then characters until the next ", then :, then figures out the value type, parses it, and so on.
For a 1MB JSON response, that's millions of string comparisons. On mobile devices, parsing can take 50-100ms, visible latency for something that should be instant.
Binary formats solve both problems: smaller on the wire, faster to decode.
Binary serialization formats
Protocol Buffers (Protobuf): Google's format. Requires a schema definition (.proto file). Fields are numbered, not named. Compact wire format. Fast decode because the schema tells the decoder what types to expect.
FlatBuffers: Also from Google. Zero-copy access, you can read fields directly from the buffer without parsing into objects. Even faster than Protobuf for read-heavy workloads.
MessagePack: "JSON, but binary." No schema required. Similar structure to JSON but with compact type markers instead of text delimiters.
BSON: MongoDB's format. Optimized for fast traversal, not minimal size. Larger than MessagePack but faster to navigate.
Cap'n Proto: Like FlatBuffers, supports zero-copy. Also supports RPC directly.
The trade-off: binary formats sacrifice human readability for efficiency. You can't cat a Protobuf file and understand it. But for high-volume APIs, the savings compound.
Restructuring for compression
Even without changing formats, you can restructure data to compress better.
Consider this list of GPS coordinates:
[
{"lat": 51.5074, "lng": -0.1278, "time": 1609459200},
{"lat": 51.5075, "lng": -0.1277, "time": 1609459201},
{"lat": 51.5076, "lng": -0.1276, "time": 1609459202}
]This is an array of structs. Each object has the same shape, but the field names repeat. After gzip, the repetition compresses, but you're still paying for JSON's overhead.
Now restructure as a struct of arrays:
{
"lat": [51.5074, 51.5075, 51.5076],
"lng": [-0.1278, -0.1277, -0.1276],
"time": [1609459200, 1609459201, 1609459202]
}Field names appear once. Values of the same type cluster together. This layout compresses dramatically better because:
- Similar values are adjacent (delta coding works)
- Type homogeneity improves pattern matching
- No repeated keys
The transformation is lossless, same data, different arrangement. But gzip sees more patterns.
Normalization: don't repeat yourself
Repetition is compression's friend and enemy. Repeated patterns compress well. But eliminating repetition before compression is even better.
Consider users with roles:
[
{"id": 1, "name": "Alice", "role": {"name": "Admin", "permissions": ["read", "write", "delete"]}},
{"id": 2, "name": "Bob", "role": {"name": "Admin", "permissions": ["read", "write", "delete"]}},
{"id": 3, "name": "Carol", "role": {"name": "Viewer", "permissions": ["read"]}}
]The "Admin" role object repeats twice. Normalize it:
{
"users": [
{"id": 1, "name": "Alice", "roleId": "admin"},
{"id": 2, "name": "Bob", "roleId": "admin"},
{"id": 3, "name": "Carol", "roleId": "viewer"}
],
"roles": {
"admin": {"name": "Admin", "permissions": ["read", "write", "delete"]},
"viewer": {"name": "Viewer", "permissions": ["read"]}
}
}Each role is defined once. Users reference roles by ID. Total bytes: smaller. Parse time: same objects, less duplication.
This is database normalization applied to API responses. The client joins the data in memory.
When to denormalize
Normalization has costs. The client must resolve references. If a user needs their role immediately, they must look it up.
Denormalize when:
- Client-side joins are expensive (mobile, embedded)
- Latency matters more than bandwidth
- Data is viewed once and discarded
Normalize when:
- Same data appears many times
- Client already caches reference data
- Bandwidth is constrained (mobile networks)
Many APIs offer both: compact normalized responses with expansion hints, or expanded denormalized responses for convenience.
Segmenting data
Not all fields are used equally. A product listing might have:
id,name,price: shown in list viewdescription,specs,reviews: shown in detail view
Sending everything to the list view wastes bytes. Segment the data:
// List endpoint: /products
[{"id": 1, "name": "Widget", "price": 9.99}, ...]
// Detail endpoint: /products/1
{"id": 1, "name": "Widget", "price": 9.99, "description": "...", "specs": {...}}Or use GraphQL to let clients specify exactly which fields they need.
The goal: send only what's necessary. Different views, different payloads.
GPU texture formats: beyond the wire
Here's something most developers miss: compression on the wire is only half the story.
When an image reaches the device, it gets decompressed and loaded into GPU memory as a texture. A 100KB JPG might decompress to a 4MB texture (1024×1024 × 4 bytes per pixel RGBA).
GPU texture compression formats (DXT, ETC, ASTC) keep images compressed in GPU memory. The GPU can render directly from the compressed format without full decompression.
This matters for:
- Games: Hundreds of textures loaded simultaneously
- Maps: Tiles that need fast switching
- AR/VR: Tight memory and latency budgets
A 4MB uncompressed texture might be 1MB in ETC2 format, and the GPU renders it directly. You get 4× more textures in the same memory, and no decompression latency.
Putting it all together
Image compression in practice means:
- Choose format wisely: WebP for web, JPG for photos, PNG for transparency
- Optimize quality per-image: Use SSIM to find the sweet spot
- Serve appropriate dimensions: Don't send 4K to a thumbnail slot
- Consider the full pipeline: Wire compression, GPU texture format, decode performance
- Measure everything: File size, decode time, memory usage, user-perceived quality
The companies that do this well save millions in bandwidth costs. The ones that don't make users wait for images that are 10× larger than necessary.
And now you know why that "quality" slider exists, why it matters where you set it, and why the answer isn't always "maximum."
