Back to Blog

Why Is My Python So Slow?

You wrote an image processing script to blur photos, detect edges, or preprocess data for machine learning. You ran it and the performance is terrible. One image takes seconds. A batch of thousands takes hours.

Python isn't slow. Your Python might be slow because of how you're using it, not because of the language itself.

The Problem: Image Blur

This post uses a concrete example: blurring an image. Everyone knows what a blurred image looks like, and the operation demonstrates matrix computation clearly.

Image Blur Replace each pixel with the average of its neighbors
Kernel size:

The algorithm replaces each pixel with the average of its neighbors. This is called a box blur or convolution:

for i in range(height):
    for j in range(width):
        total = 0
        for ki in range(-1, 2):
            for kj in range(-1, 2):
                total += image[i + ki][j + kj]
        blurred[i][j] = total / 9

Four nested loops. Just addition and division. The simplest operations a computer can do.

So why does this take forever on a large image?

Python Lists Are Not What You Think

When you write image = [[0] * width for _ in range(height)], you're creating a list of pointers (memory addresses that reference data stored elsewhere) to other lists of pointers to Python objects scattered in memory. You're not creating a contiguous block of numbers.

Memory Layout Python lists store pointers, NumPy arrays store contiguous data
Click index:
Pointers
200
350
180
420
290
Data (scattered)
@180
2
@200
0
@290
4
@350
1
@420
3

Try it yourself. Build an array and click elements to see how many memory operations each access requires:

Build Your Own Array Add elements and see how memory layout differs
0 / 8 elements
Click "Add Element" to start building your array

Step through the access process:

Pointer Chase Step through accessing list[2][1]
Start: Access list[2]
Outer list (pointers)
200
250
300
350
400
Memory accesses: 0 / 4

Every time you access image[i][j], the CPU has to:

  1. Look up the pointer at position i in the outer list
  2. Follow that pointer to find the inner list
  3. Look up the pointer at position j in that list
  4. Follow that pointer to find the actual number

Four steps to access one pixel.

The computer can access any memory address, but not all memory accesses are the same speed. To understand why Python's pointer chasing is so expensive, we need to look at how the CPU actually fetches data from memory.

The Memory Hierarchy

Modern CPUs can do billions of calculations per second, but RAM can't keep up.

Fetching data from RAM takes about 100 CPU cycles (the basic unit of CPU time - one tick of the processor's clock). The CPU sits idle for 100 cycles waiting for data. This is the von Neumann bottleneck: the limited bandwidth between the CPU and memory that prevents data from being delivered as fast as the CPU can process it.

CPUs solve this with caches: small, fast memory banks built directly into the processor chip, close to the processing cores.

Memory LevelSizeSpeed
L1 Cache~32 KB4 cycles
L2 Cache~256 KB12 cycles
L3 Cache~8 MB40 cycles
RAM~16 GB100+ cycles

When the CPU needs data, it first checks L1. If it's not there (cache miss), it checks L2, then L3, then RAM. Each level is slower.

Caches load data in chunks. When you access memory address 100, the CPU loads addresses 100-163 into cache. If your next access is address 101, it's already there (cache hit). If your next access is address 500, you need another RAM trip (cache miss).

Cache Block Loading Access address 100, get 100-107 for free
RAM
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
CPU needs address 100
Cache Hits vs Misses Sequential access keeps data in cache, random access doesn't
CPU Cache (2 blocks × 4 elements)
0
1
2
3
empty
RAM
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
Cache Hits
1
Cache Misses
0%
Hit Rate

Sequential access patterns are fast. Random access is slow. Python's fragmented list structure forces random access.

Memory Fragmentation

Python doesn't natively support vectorization because:

  1. Python lists store pointers to data, not the data itself
  2. Python bytecode isn't optimized for vectorization

When data is fragmented (scattered across memory), you can't move it efficiently. You chase pointers across the heap instead of copying one big block. The CPU's prefetcher (hardware that predicts which memory you'll need next and loads it early) can't predict the pattern.

Sequential access: prefetcher predicts correctly.

Sequential Access Prefetcher predicts correctly
Memory
0
1
2
3
4
5
6
7
Current access: 0
Prefetcher starting up...

Random access: prefetcher can't predict.

Random Access Prefetcher can't predict
Memory
0
1
2
3
4
5
6
7
Current access: 7
Prefetcher starting up...

This prevents SIMD (Single Instruction, Multiple Data) operations - a CPU hardware feature that processes multiple numbers simultaneously. SIMD can multiply 4, 8, or 16 numbers in a single instruction, but only if those numbers sit next to each other in memory.

Measuring the Problem: perf

Before we fix the problem, let's measure it precisely. Understanding exactly what's happening at the hardware level will guide our optimizations.

Linux has a tool called perf that reads hardware performance counters directly from the CPU.

Run your Python script with:

perf stat python blur_python.py

Click the highlighted lines to see what each metric means:

perf output (python)
$perf stat python blur_python.py
Performance counter stats for 'python blur_python.py':
526,056,413,046 cycles # 3.190 GHz
1,108,102,913,558 instructions # 2.11 insn per cycle
1,631,523,018 cache-references # 9.893 M/sec
64,843,956 cache-misses # 3.974 % of all cache refs
213,479,087,583 branches # 1294.434 M/sec
2,555,147,651 branch-misses # 1.20% of all branches
164,920.79 msec task-clock # 1.000 CPUs utilized
12,593 faults # 0.076 K/sec
12,593 minor-faults # 0.076 K/sec
164.928479100 seconds time elapsed
Click highlighted lines to see explanations (7 annotations)

The problems:

  • 1 trillion instructions for a simple blur
  • Only using 1 CPU due to Python's GIL (Global Interpreter Lock - a mechanism that prevents Python from using multiple CPU cores for pure Python code)
  • 65 million cache misses from pointer chasing

Now that we understand the problem - scattered memory causing cache misses - let's see how NumPy fixes it.

Enter NumPy: Contiguous Memory

NumPy stores data as a contiguous block of raw numbers. No pointers, no Python object overhead, just bytes in a row.

import numpy as np
image = np.zeros((height, width))  # One contiguous block

When you access image[0, 0], the CPU loads a chunk of the image into cache. Accessing image[0, 1] through image[0, 15] costs almost nothing because they're already cached.

The same code with NumPy:

perf output (numpy)
$perf stat python blur_numpy.py
Performance counter stats for 'python blur_numpy.py':
11,285,119,496 cycles # 3.186 GHz
12,434,411,770 instructions # 1.10 insn per cycle
1,539,129,790 cache-references # 434.563 M/sec
142,870,817 cache-misses # 9.283 % of all cache refs
1,799,214,646 branches # 507.997 M/sec
36,642,230 branch-misses # 2.04% of all branches
3,541.78 msec task-clock # 1.613 CPUs utilized
508,036 faults # 0.143 M/sec
508,036 minor-faults # 0.143 M/sec
2.195482613 seconds time elapsed
Click highlighted lines to see explanations (7 annotations)

The differences:

  • 89× fewer instructions (12 billion vs 1.1 trillion)
  • 47× fewer cycles (11 billion vs 526 billion)
  • Uses 1.6 CPUs instead of 1 (automatic parallelism)

Contiguous memory gets us 75× faster, but there's another optimization NumPy uses: vectorization. Instead of processing one number at a time, modern CPUs can process multiple numbers simultaneously.

Vectorization: Do Four Things at Once

Modern CPUs have SIMD (Single Instruction, Multiple Data) instructions. SIMD processes 4, 8, or 16 pixels in a single instruction instead of one at a time.

SIMD Registers Load 4 values, operate once, store 4 results
Step 1: Load 4 values into SIMD register
Memory (source array)
10
20
30
40

Python for loop: one element at a time.

Scalar Operations Process one element at a time
1
2
3
4
5
6
7
8
×
2
3
4
5
6
7
8
9
0 / 8 operations
One instruction per element

NumPy: 4 elements per instruction.

SIMD Operations Process 4 elements per instruction
1
2
3
4
5
6
7
8
×
2
3
4
5
6
7
8
9
1 / 2 operations
One instruction processes 4 elements

A Python for loop forces the CPU to do one operation at a time. NumPy's vectorized operations use SIMD instructions behind the scenes.

Python Loop (h × w operations, no SIMD):

# Process pixels one at a time
for i in range(height):
    for j in range(width):
        result[i, j] = image[i, j] * 2

NumPy Vectorized (h×w / 8 operations, uses SIMD):

# Process 4-8 pixels per instruction
result = image * 2

Same result, dramatically different performance. For our blur operation, SciPy provides scipy.ndimage.convolve, which uses these vectorized instructions.

The Hidden Cost: Memory Allocation

Contiguous memory and vectorization give us massive speedups, but there's one more optimization that often gets overlooked: avoiding unnecessary memory allocations.

Every time we process a frame or apply a filter, we might create new arrays:

def blur(image):
    result = np.zeros_like(image)  # Allocation happens here!
    # ... convolution ...
    return result

Memory allocation is expensive:

  1. Ask the operating system for memory (system call overhead)
  2. Find a suitable free block (memory management overhead)
  3. Initialize the memory to zeros (writing to every byte)
  4. Later, garbage collect the old array (cleanup overhead)

Reuse memory instead. Allocate your arrays once, then write to them in-place:

Memory Allocation Allocating each iteration vs reusing
Memory Heap
Press play to start
0
Allocations
0
Iterations
# Allocate once, outside the loop
result = np.zeros((height, width))
 
for frame in video_frames:
    # Write into existing array, no allocation
    scipy.ndimage.convolve(frame, kernel, output=result)
    process(result)

NumPy and SciPy support in-place operations with the output= or out= parameter:

# This allocates a new array
c = a + b
 
# This writes into existing array
np.add(a, b, out=c)

Going Further: GPUs and Other Tools

For even larger workloads, GPUs can process thousands of operations in parallel using libraries like PyTorch or CuPy. However, for most image processing tasks like our blur operation, the overhead of transferring data between CPU and GPU outweighs the benefits unless you're processing very large batches (thousands of images) or working with 4K+ resolution.

Putting It All Together

Here's how our blur code evolves as we apply each optimization:

The Optimization Journey Click each version to see what changed
python
1def blur(image):
2 h, w = len(image), len(image[0])
3 result = [[0] * w for _ in range(h)]
4 for i in range(1, h-1):
5 for j in range(1, w-1):
6 total = 0
7 for ki in range(-1, 2):
8 for kj in range(-1, 2):
9 total += image[i+ki][j+kj]
10 result[i][j] = total / 9
11 return result
Pure Python with nested lists. Four nested loops, pointer chasing on every access, no vectorization possible.
Time
631s
Instructions
1.1T

The Results

From 631 seconds to under 2 seconds. A 350× speedup without changing the algorithm.

ImplementationTimeSpeedupDescription
Pure Python631sNested loops with Python lists
NumPy basic8.5s74×NumPy arrays with scipy.ndimage.convolve
NumPy in-place3.2s197×Reusing arrays with output= parameter
Optimized1.8s350×Optimized convolution with SIMD
OptimizationSpeedupWhy
Reuse memory (Python)1.16×Fewer allocations
NumPy arrays75×Contiguous memory + vectorization
In-place operations2.6×No allocation per iteration
Optimized libraries1.8×Better cache utilization

The Takeaways

  1. Python lists aren't arrays. They're collections of pointers. For numerical work, use NumPy.

  2. Memory layout matters. Contiguous data gets cache hits. Scattered data gets cache misses. Cache misses are 25× slower.

  3. Vectorize everything. Let NumPy use SIMD instructions instead of writing Python for loops over numerical data.

  4. Allocations are expensive. Create arrays once and reuse them with out= parameters for in-place operations.

  5. Measure before you optimize. Use perf to see what's happening. Cache misses mean memory fragmentation. Too many instructions mean too much Python overhead.

  6. GPUs aren't always faster. Transfer overhead and orchestration can dominate runtime. Profile before assuming GPUs will help.

The fastest code is code that doesn't run. NumPy and SciPy let you express operations at a high level while C code handles low-level optimizations.

What's Next?

This is just the beginning. Other optimization approaches include:

  • Cython: Compile Python to C for even more speed
  • Numba: JIT compilation for NumPy code
  • OpenCV: Highly optimized computer vision library
  • JAX: NumPy with auto-differentiation and XLA compilation

The principles remain the same: understand your hardware, minimize memory movement, and let specialized libraries do the heavy lifting.

Your CPU is powerful. You just need to feed it data fast enough.