Back to Blog

How does diff work?

I wanted a diff component for this blog. Something that could show code changes inline, with red lines for deletions and green lines for additions. The kind of thing you see in pull request reviews.

What diff produces Given two versions of code, diff finds the changes
Original
function greet(name) {
  console.log("Hello");
  return name;
}
Modified
function greet(name) {
  console.log("Hello, " + name);
  return name.toUpperCase();
}
Diff output
function greet(name) {
- console.log("Hello");
- return name;
+ console.log("Hello, " + name);
+ return name.toUpperCase();
}
Deleted
Added
Unchanged

I figured I'd just find a library and drop it in. But the more I looked at how these tools actually work, the more I wanted to understand what was happening underneath. How does the algorithm figure out which lines were added and which were removed? Why does it sometimes get it "wrong" when the minimal diff seems obvious to a human?

So I built one from scratch. Here's what I learned.

The problem

Say you have two versions of a file and you want to find the differences. The naive approach might be to compare them line by line:

Old: A B C
New: A C

Position 0: A = A (match!)
Position 1: B ≠ C (changed?)
Position 2: C vs nothing (deleted?)

Step through to see why this happens:

Naive approach Comparing position by position
Original
A
B
C
New
A
C
Compare position 0
A = A (match)

The naive approach compares by position. When B disappears, it thinks C "changed" to fill its spot. This produces a diff with 4 lines when only 3 are needed.

So how do we find the smallest set of changes? We need to find what didn't change. If we know what's common to both versions, the remaining items must be either insertions or deletions.

The longest common subsequence

To find the minimal diff, we solve a different problem first: finding the Longest Common Subsequence (LCS).

Given two sequences, the LCS is the longest sequence of items that appears in both, in the same order. The items don't need to be adjacent, they just need to maintain their relative order.

The Longest Common Subsequence Finding the longest sequence present in both
Sequence A (original)
A
B
C
D
F
G
H
J
Q
Z
Sequence B (modified)
A
B
C
D
E
F
G
I
J
K
R
X
Y
Z
Longest Common Subsequence
A
B
C
D
F
G
J
Z
The LCS keeps the order from both sequences. Items not in the LCS are either deleted (in A but not LCS) or inserted (in B but not LCS).

Once we have the LCS, producing the diff is straightforward. Items in the original but not in the LCS were deleted. Items in the new file but not in the LCS were inserted. Items in the LCS are unchanged.

Here's the same example using the LCS approach:

LCS approach Find what's common first
Original:
A
B
C
New:
A
C
Comparing
A
B
C
A
C
Find the Longest Common Subsequence
Walk through both strings to find elements that appear in both.

Instead of comparing by position, LCS asks "what items appear in both sequences?" It finds that A and C are common, so only B needs to be marked as deleted. The diff is now 3 lines instead of 4. The diff algorithm is really just an LCS algorithm: find what's common, and the differences reveal themselves.

Visualizing as a graph

You can think about diff as a grid. Say we want to transform "ABCAB" into "CBAB":

  • The x-axis represents the original string (ABCAB)
  • The y-axis represents the new string (CBAB)
  • Each cell (x, y) means "we've processed x characters of the original and y characters of the new"

We start at the top-left corner and need to reach the bottom-right. At each step, moving right means deleting a character from the original. Moving down means inserting a character from the new string. Moving diagonally means the characters match, so no edit is needed.

The Edit Graph Diff as finding the shortest path through a grid
Old:
A
B
C
A
B
New:
C
B
A
B
ABCABCBAB
Right = delete
Down = insert
Diagonal = match (free!)
Start at (0, 0)

The shortest path through this grid gives us the minimum edit distance, the smallest number of changes needed. Diagonal moves are "free" because matches don't count as edits.

This is the idea behind the Myers diff algorithm, which Git uses: find the shortest path through an edit graph.

How the algorithm works

The Myers algorithm explores paths in waves, prioritizing diagonals (matches):

  1. Start at position (0, 0)
  2. For each "depth" d (number of non-diagonal moves):
    • Explore all possible paths that use exactly d moves
    • Follow diagonal moves for free whenever characters match
    • If we reach the end, we found the shortest path!

What makes this fast is that we can process these waves efficiently. Each depth d can only reach certain diagonal lines (numbered by x - y). We track the furthest point reached on each diagonal.

Step through the algorithm to see how it works. Try changing the input strings to see how the algorithm adapts.

Myers algorithm step-by-step Watch the algorithm find the shortest edit path
Old:
A
B
C
New:
A
C
Initialize: Start at (0,0). Track the furthest x position reached on each diagonal k (where k = x - y).
editDepth=0
diagonal=0
x=0
y=0
furthestX={0:0}
furthestX = { 0: 0 }
for editDepth in 0..max:
for diagonal in range(-editDepth, editDepth+1, step=2):
# Starting position - no edits yet
if editDepth == 0:
x = 0
# Insert: come from diagonal k+1, move down (y++)
elif shouldInsert(diagonal, editDepth):
x = furthestX[diagonal + 1]
# Delete: come from diagonal k-1, move right (x++)
else:
x = furthestX[diagonal - 1] + 1
# Follow diagonal (free matches)
while old[x] == new[y]: x++; y++
furthestX[diagonal] = x
if x >= len(old) and y >= len(new):
return editDepth # Found shortest path!
ABCAC

The algorithm is O(ND) where N is the file size and D is the number of differences. For similar files (small D), it runs quickly.

Building the edit script

Once we have the path through the edit graph, we can generate the actual diff output. Each move in the path corresponds to an edit operation:

Building the edit script From LCS path to diff output
Original:
A
B
C
New:
A
C
Step through
A
B
C
A
C
We want to transform Original into New
Edit script (builds up)
(empty)

The diff format you see in git and other tools encodes this path:

  • Lines starting with - are deletions (moved right in the graph)
  • Lines starting with + are insertions (moved down)
  • Lines with no prefix are context (diagonal moves)

The unified diff format

The output format used by git diff and patch is called unified diff. It's designed to be both human-readable and machine-applicable.

Unified diff format Step through each line to see what it means
--- a/greet.js
+++ b/greet.js
@@ -1,4 +1,4 @@
function greet(name) {
- console.log("Hello");
+ console.log("Hello, " + name);
return name;
}
Original file path — shows which file we're diffing from

The --- and +++ lines show which files are being compared. The @@ lines tell you where in the file the changes are. Lines starting with a space are context, which helps you find the right location even if line numbers have changed. The - and + lines show what was removed and added.

Try it yourself

Edit the text below and watch the algorithm find the minimal changes in real time.

Try it yourself Edit the text and see the diff update in real time
Original
Modified
-2deleted
+2added
1unchanged
-The quick brown fox
+The quick red fox
jumps over
-the lazy dog
+the happy dog

What I ended up with

The diff component I built for this blog uses a simplified version of the Myers algorithm. It's fast enough for the small code snippets I'm comparing, and now I actually understand what's happening when I see those red and green lines.

Instead of asking "what changed?" you ask "what stayed the same?" You find the LCS, treat those lines as anchors, and everything else is either an insertion or a deletion. The edit graph visualization helped me see why naive line-by-line comparison produces worse diffs than necessary. It doesn't account for lines shifting position.

If you're building something similar, the unified diff format is worth understanding. Those context lines aren't just for human readability; they let patches apply correctly even when line numbers have drifted.