Why Is My Python So Slow?
You wrote an image processing script to blur photos, detect edges, or preprocess data for machine learning. You ran it and the performance is terrible. One image takes seconds. A batch of thousands takes hours.
Python isn't slow. Your Python might be slow because of how you're using it, not because of the language itself.
The Problem: Image Blur
This post uses a concrete example: blurring an image. Everyone knows what a blurred image looks like, and the operation demonstrates matrix computation clearly.
The algorithm replaces each pixel with the average of its neighbors. This is called a box blur or convolution:
for i in range(height):
for j in range(width):
total = 0
for ki in range(-1, 2):
for kj in range(-1, 2):
total += image[i + ki][j + kj]
blurred[i][j] = total / 9Four nested loops. Just addition and division. The simplest operations a computer can do.
So why does this take forever on a large image?
Python Lists Are Not What You Think
When you write image = [[0] * width for _ in range(height)], you're creating a list of pointers (memory addresses that reference data stored elsewhere) to other lists of pointers to Python objects scattered in memory. You're not creating a contiguous block of numbers.
Try it yourself. Build an array and click elements to see how many memory operations each access requires:
Step through the access process:
Every time you access image[i][j], the CPU has to:
- Look up the pointer at position
iin the outer list - Follow that pointer to find the inner list
- Look up the pointer at position
jin that list - Follow that pointer to find the actual number
Four steps to access one pixel.
The computer can access any memory address, but not all memory accesses are the same speed. To understand why Python's pointer chasing is so expensive, we need to look at how the CPU actually fetches data from memory.
The Memory Hierarchy
Modern CPUs can do billions of calculations per second, but RAM can't keep up.
Fetching data from RAM takes about 100 CPU cycles (the basic unit of CPU time - one tick of the processor's clock). The CPU sits idle for 100 cycles waiting for data. This is the von Neumann bottleneck: the limited bandwidth between the CPU and memory that prevents data from being delivered as fast as the CPU can process it.
CPUs solve this with caches: small, fast memory banks built directly into the processor chip, close to the processing cores.
| Memory Level | Size | Speed |
|---|---|---|
| L1 Cache | ~32 KB | 4 cycles |
| L2 Cache | ~256 KB | 12 cycles |
| L3 Cache | ~8 MB | 40 cycles |
| RAM | ~16 GB | 100+ cycles |
When the CPU needs data, it first checks L1. If it's not there (cache miss), it checks L2, then L3, then RAM. Each level is slower.
Caches load data in chunks. When you access memory address 100, the CPU loads addresses 100-163 into cache. If your next access is address 101, it's already there (cache hit). If your next access is address 500, you need another RAM trip (cache miss).
Sequential access patterns are fast. Random access is slow. Python's fragmented list structure forces random access.
Memory Fragmentation
Python doesn't natively support vectorization because:
- Python lists store pointers to data, not the data itself
- Python bytecode isn't optimized for vectorization
When data is fragmented (scattered across memory), you can't move it efficiently. You chase pointers across the heap instead of copying one big block. The CPU's prefetcher (hardware that predicts which memory you'll need next and loads it early) can't predict the pattern.
Sequential access: prefetcher predicts correctly.
Random access: prefetcher can't predict.
This prevents SIMD (Single Instruction, Multiple Data) operations - a CPU hardware feature that processes multiple numbers simultaneously. SIMD can multiply 4, 8, or 16 numbers in a single instruction, but only if those numbers sit next to each other in memory.
Measuring the Problem: perf
Before we fix the problem, let's measure it precisely. Understanding exactly what's happening at the hardware level will guide our optimizations.
Linux has a tool called perf that reads hardware performance counters directly from the CPU.
Run your Python script with:
perf stat python blur_python.pyClick the highlighted lines to see what each metric means:
The problems:
- 1 trillion instructions for a simple blur
- Only using 1 CPU due to Python's GIL (Global Interpreter Lock - a mechanism that prevents Python from using multiple CPU cores for pure Python code)
- 65 million cache misses from pointer chasing
Now that we understand the problem - scattered memory causing cache misses - let's see how NumPy fixes it.
Enter NumPy: Contiguous Memory
NumPy stores data as a contiguous block of raw numbers. No pointers, no Python object overhead, just bytes in a row.
import numpy as np
image = np.zeros((height, width)) # One contiguous blockWhen you access image[0, 0], the CPU loads a chunk of the image into cache. Accessing image[0, 1] through image[0, 15] costs almost nothing because they're already cached.
The same code with NumPy:
The differences:
- 89× fewer instructions (12 billion vs 1.1 trillion)
- 47× fewer cycles (11 billion vs 526 billion)
- Uses 1.6 CPUs instead of 1 (automatic parallelism)
Contiguous memory gets us 75× faster, but there's another optimization NumPy uses: vectorization. Instead of processing one number at a time, modern CPUs can process multiple numbers simultaneously.
Vectorization: Do Four Things at Once
Modern CPUs have SIMD (Single Instruction, Multiple Data) instructions. SIMD processes 4, 8, or 16 pixels in a single instruction instead of one at a time.
Python for loop: one element at a time.
NumPy: 4 elements per instruction.
A Python for loop forces the CPU to do one operation at a time. NumPy's vectorized operations use SIMD instructions behind the scenes.
Python Loop (h × w operations, no SIMD):
# Process pixels one at a time
for i in range(height):
for j in range(width):
result[i, j] = image[i, j] * 2NumPy Vectorized (h×w / 8 operations, uses SIMD):
# Process 4-8 pixels per instruction
result = image * 2Same result, dramatically different performance. For our blur operation, SciPy provides scipy.ndimage.convolve, which uses these vectorized instructions.
The Hidden Cost: Memory Allocation
Contiguous memory and vectorization give us massive speedups, but there's one more optimization that often gets overlooked: avoiding unnecessary memory allocations.
Every time we process a frame or apply a filter, we might create new arrays:
def blur(image):
result = np.zeros_like(image) # Allocation happens here!
# ... convolution ...
return resultMemory allocation is expensive:
- Ask the operating system for memory (system call overhead)
- Find a suitable free block (memory management overhead)
- Initialize the memory to zeros (writing to every byte)
- Later, garbage collect the old array (cleanup overhead)
Reuse memory instead. Allocate your arrays once, then write to them in-place:
# Allocate once, outside the loop
result = np.zeros((height, width))
for frame in video_frames:
# Write into existing array, no allocation
scipy.ndimage.convolve(frame, kernel, output=result)
process(result)NumPy and SciPy support in-place operations with the output= or out= parameter:
# This allocates a new array
c = a + b
# This writes into existing array
np.add(a, b, out=c)Going Further: GPUs and Other Tools
For even larger workloads, GPUs can process thousands of operations in parallel using libraries like PyTorch or CuPy. However, for most image processing tasks like our blur operation, the overhead of transferring data between CPU and GPU outweighs the benefits unless you're processing very large batches (thousands of images) or working with 4K+ resolution.
Putting It All Together
Here's how our blur code evolves as we apply each optimization:
The Results
From 631 seconds to under 2 seconds. A 350× speedup without changing the algorithm.
| Implementation | Time | Speedup | Description |
|---|---|---|---|
| Pure Python | 631s | 1× | Nested loops with Python lists |
| NumPy basic | 8.5s | 74× | NumPy arrays with scipy.ndimage.convolve |
| NumPy in-place | 3.2s | 197× | Reusing arrays with output= parameter |
| Optimized | 1.8s | 350× | Optimized convolution with SIMD |
| Optimization | Speedup | Why |
|---|---|---|
| Reuse memory (Python) | 1.16× | Fewer allocations |
| NumPy arrays | 75× | Contiguous memory + vectorization |
| In-place operations | 2.6× | No allocation per iteration |
| Optimized libraries | 1.8× | Better cache utilization |
The Takeaways
-
Python lists aren't arrays. They're collections of pointers. For numerical work, use NumPy.
-
Memory layout matters. Contiguous data gets cache hits. Scattered data gets cache misses. Cache misses are 25× slower.
-
Vectorize everything. Let NumPy use SIMD instructions instead of writing Python for loops over numerical data.
-
Allocations are expensive. Create arrays once and reuse them with
out=parameters for in-place operations. -
Measure before you optimize. Use
perfto see what's happening. Cache misses mean memory fragmentation. Too many instructions mean too much Python overhead. -
GPUs aren't always faster. Transfer overhead and orchestration can dominate runtime. Profile before assuming GPUs will help.
The fastest code is code that doesn't run. NumPy and SciPy let you express operations at a high level while C code handles low-level optimizations.
What's Next?
This is just the beginning. Other optimization approaches include:
- Cython: Compile Python to C for even more speed
- Numba: JIT compilation for NumPy code
- OpenCV: Highly optimized computer vision library
- JAX: NumPy with auto-differentiation and XLA compilation
The principles remain the same: understand your hardware, minimize memory movement, and let specialized libraries do the heavy lifting.
Your CPU is powerful. You just need to feed it data fast enough.