How does diff work?
I wanted a diff component for this blog. Something that could show code changes inline, with red lines for deletions and green lines for additions. The kind of thing you see in pull request reviews.
I figured I'd just find a library and drop it in. But the more I looked at how these tools actually work, the more I wanted to understand what was happening underneath. How does the algorithm figure out which lines were added and which were removed? Why does it sometimes get it "wrong" when the minimal diff seems obvious to a human?
So I built one from scratch. Here's what I learned.
The problem
Say you have two versions of a file and you want to find the differences. The naive approach might be to compare them line by line:
Old: A B C
New: A C
Position 0: A = A (match!)
Position 1: B ≠ C (changed?)
Position 2: C vs nothing (deleted?)
Step through to see why this happens:
The naive approach compares by position. When B disappears, it thinks C "changed" to fill its spot. This produces a diff with 4 lines when only 3 are needed.
So how do we find the smallest set of changes? We need to find what didn't change. If we know what's common to both versions, the remaining items must be either insertions or deletions.
The longest common subsequence
To find the minimal diff, we solve a different problem first: finding the Longest Common Subsequence (LCS).
Given two sequences, the LCS is the longest sequence of items that appears in both, in the same order. The items don't need to be adjacent, they just need to maintain their relative order.
Once we have the LCS, producing the diff is straightforward. Items in the original but not in the LCS were deleted. Items in the new file but not in the LCS were inserted. Items in the LCS are unchanged.
Here's the same example using the LCS approach:
Instead of comparing by position, LCS asks "what items appear in both sequences?" It finds that A and C are common, so only B needs to be marked as deleted. The diff is now 3 lines instead of 4. The diff algorithm is really just an LCS algorithm: find what's common, and the differences reveal themselves.
Visualizing as a graph
You can think about diff as a grid. Say we want to transform "ABCAB" into "CBAB":
- The x-axis represents the original string (ABCAB)
- The y-axis represents the new string (CBAB)
- Each cell
(x, y)means "we've processed x characters of the original and y characters of the new"
We start at the top-left corner and need to reach the bottom-right. At each step, moving right means deleting a character from the original. Moving down means inserting a character from the new string. Moving diagonally means the characters match, so no edit is needed.
The shortest path through this grid gives us the minimum edit distance, the smallest number of changes needed. Diagonal moves are "free" because matches don't count as edits.
This is the idea behind the Myers diff algorithm, which Git uses: find the shortest path through an edit graph.
How the algorithm works
The Myers algorithm explores paths in waves, prioritizing diagonals (matches):
- Start at position (0, 0)
- For each "depth" d (number of non-diagonal moves):
- Explore all possible paths that use exactly d moves
- Follow diagonal moves for free whenever characters match
- If we reach the end, we found the shortest path!
What makes this fast is that we can process these waves efficiently. Each depth d can only reach certain diagonal lines (numbered by x - y). We track the furthest point reached on each diagonal.
Step through the algorithm to see how it works. Try changing the input strings to see how the algorithm adapts.
The algorithm is O(ND) where N is the file size and D is the number of differences. For similar files (small D), it runs quickly.
Building the edit script
Once we have the path through the edit graph, we can generate the actual diff output. Each move in the path corresponds to an edit operation:
The diff format you see in git and other tools encodes this path:
- Lines starting with
-are deletions (moved right in the graph) - Lines starting with
+are insertions (moved down) - Lines with no prefix are context (diagonal moves)
The unified diff format
The output format used by git diff and patch is called unified diff. It's designed to be both human-readable and machine-applicable.
The --- and +++ lines show which files are being compared. The @@ lines tell you where in the file the changes are. Lines starting with a space are context, which helps you find the right location even if line numbers have changed. The - and + lines show what was removed and added.
Try it yourself
Edit the text below and watch the algorithm find the minimal changes in real time.
What I ended up with
The diff component I built for this blog uses a simplified version of the Myers algorithm. It's fast enough for the small code snippets I'm comparing, and now I actually understand what's happening when I see those red and green lines.
Instead of asking "what changed?" you ask "what stayed the same?" You find the LCS, treat those lines as anchors, and everything else is either an insertion or a deletion. The edit graph visualization helped me see why naive line-by-line comparison produces worse diffs than necessary. It doesn't account for lines shifting position.
If you're building something similar, the unified diff format is worth understanding. Those context lines aren't just for human readability; they let patches apply correctly even when line numbers have drifted.