Why Is My Python So Slow?

You wrote an image processing script to blur photos, detect edges, or preprocess data for machine learning. You ran it and the performance is terrible. One image takes seconds. A batch of thousands takes hours.

Python isn't slow. Your Python might be slow because of how you're using it, not because of the language itself.

The Problem: Image Blur

This post uses a concrete example: blurring an image. Everyone knows what a blurred image looks like, and the operation demonstrates matrix computation clearly.

The algorithm replaces each pixel with the average of its neighbors. This is called a box blur or convolution:

for i in range(height):
    for j in range(width):
        total = 0
        for ki in range(-1, 2):
            for kj in range(-1, 2):
                total += image[i + ki][j + kj]
        blurred[i][j] = total / 9

Four nested loops. Just addition and division. The simplest operations a computer can do.

So why does this take forever on a large image?

Python Lists Are Not What You Think

When you write image = [[0] * width for _ in range(height)], you're creating a list of pointers (memory addresses that reference data stored elsewhere) to other lists of pointers to Python objects scattered in memory. You're not creating a contiguous block of numbers.

Try it yourself. Build an array and click elements to see how many memory operations each access requires:

Step through the access process:

Every time you access image[i][j], the CPU has to:

Look up the pointer at position i in the outer list
Follow that pointer to find the inner list
Look up the pointer at position j in that list
Follow that pointer to find the actual number

Four steps to access one pixel.

The computer can access any memory address, but not all memory accesses are the same speed. To understand why Python's pointer chasing is so expensive, we need to look at how the CPU actually fetches data from memory.

The Memory Hierarchy

Modern CPUs can do billions of calculations per second, but RAM can't keep up.

Fetching data from RAM takes about 100 CPU cycles (the basic unit of CPU time - one tick of the processor's clock). The CPU sits idle for 100 cycles waiting for data. This is the von Neumann bottleneck: the limited bandwidth between the CPU and memory that prevents data from being delivered as fast as the CPU can process it.

CPUs solve this with caches: small, fast memory banks built directly into the processor chip, close to the processing cores.

Memory Level	Size	Speed
L1 Cache	~32 KB	4 cycles
L2 Cache	~256 KB	12 cycles
L3 Cache	~8 MB	40 cycles
RAM	~16 GB	100+ cycles

When the CPU needs data, it first checks L1. If it's not there (cache miss), it checks L2, then L3, then RAM. Each level is slower.

Caches load data in chunks. When you access memory address 100, the CPU loads addresses 100-163 into cache. If your next access is address 101, it's already there (cache hit). If your next access is address 500, you need another RAM trip (cache miss).

Sequential access patterns are fast. Random access is slow. Python's fragmented list structure forces random access.

Memory Fragmentation

Python doesn't natively support vectorization because:

Python lists store pointers to data, not the data itself
Python bytecode isn't optimized for vectorization

When data is fragmented (scattered across memory), you can't move it efficiently. You chase pointers across the heap instead of copying one big block. The CPU's prefetcher (hardware that predicts which memory you'll need next and loads it early) can't predict the pattern.

Sequential access: prefetcher predicts correctly.

Random access: prefetcher can't predict.

This prevents SIMD (Single Instruction, Multiple Data) operations - a CPU hardware feature that processes multiple numbers simultaneously. SIMD can multiply 4, 8, or 16 numbers in a single instruction, but only if those numbers sit next to each other in memory.

Measuring the Problem: perf

Before we fix the problem, let's measure it precisely. Understanding exactly what's happening at the hardware level will guide our optimizations.

Linux has a tool called perf that reads hardware performance counters directly from the CPU.

Run your Python script with:

perf stat python blur_python.py

Click the highlighted lines to see what each metric means:

perf output (python)

$perf stat python blur_python.py

Performance counter stats for 'python blur_python.py':

526,056,413,046 cycles # 3.190 GHz◀

1,108,102,913,558 instructions # 2.11 insn per cycle◀

1,631,523,018 cache-references # 9.893 M/sec◀

64,843,956 cache-misses # 3.974 % of all cache refs◀

213,479,087,583 branches # 1294.434 M/sec◀

2,555,147,651 branch-misses # 1.20% of all branches◀

164,920.79 msec task-clock # 1.000 CPUs utilized◀

12,593 faults # 0.076 K/sec

12,593 minor-faults # 0.076 K/sec

164.928479100 seconds time elapsed

Click highlighted lines to see explanations (7 annotations)

The problems:

1 trillion instructions for a simple blur
Only using 1 CPU due to Python's GIL (Global Interpreter Lock - a mechanism that prevents Python from using multiple CPU cores for pure Python code)
65 million cache misses from pointer chasing

Now that we understand the problem - scattered memory causing cache misses - let's see how NumPy fixes it.

Enter NumPy: Contiguous Memory

NumPy stores data as a contiguous block of raw numbers. No pointers, no Python object overhead, just bytes in a row.

import numpy as np
image = np.zeros((height, width))  # One contiguous block

When you access image[0, 0], the CPU loads a chunk of the image into cache. Accessing image[0, 1] through image[0, 15] costs almost nothing because they're already cached.

The same code with NumPy:

perf output (numpy)

$perf stat python blur_numpy.py

Performance counter stats for 'python blur_numpy.py':

11,285,119,496 cycles # 3.186 GHz◀

12,434,411,770 instructions # 1.10 insn per cycle◀

1,539,129,790 cache-references # 434.563 M/sec◀

142,870,817 cache-misses # 9.283 % of all cache refs◀

1,799,214,646 branches # 507.997 M/sec◀

36,642,230 branch-misses # 2.04% of all branches

3,541.78 msec task-clock # 1.613 CPUs utilized◀

508,036 faults # 0.143 M/sec

508,036 minor-faults # 0.143 M/sec

2.195482613 seconds time elapsed◀

Click highlighted lines to see explanations (7 annotations)

The differences:

89× fewer instructions (12 billion vs 1.1 trillion)
47× fewer cycles (11 billion vs 526 billion)
Uses 1.6 CPUs instead of 1 (automatic parallelism)

Contiguous memory gets us 75× faster, but there's another optimization NumPy uses: vectorization. Instead of processing one number at a time, modern CPUs can process multiple numbers simultaneously.

Vectorization: Do Four Things at Once

Modern CPUs have SIMD (Single Instruction, Multiple Data) instructions. SIMD processes 4, 8, or 16 pixels in a single instruction instead of one at a time.

Python for loop: one element at a time.

NumPy: 4 elements per instruction.

A Python for loop forces the CPU to do one operation at a time. NumPy's vectorized operations use SIMD instructions behind the scenes.

Python Loop (h × w operations, no SIMD):

# Process pixels one at a time
for i in range(height):
    for j in range(width):
        result[i, j] = image[i, j] * 2

NumPy Vectorized (h×w / 8 operations, uses SIMD):

# Process 4-8 pixels per instruction
result = image * 2

Same result, dramatically different performance. For our blur operation, SciPy provides scipy.ndimage.convolve, which uses these vectorized instructions.

The Hidden Cost: Memory Allocation

Contiguous memory and vectorization give us massive speedups, but there's one more optimization that often gets overlooked: avoiding unnecessary memory allocations.

Every time we process a frame or apply a filter, we might create new arrays:

def blur(image):
    result = np.zeros_like(image)  # Allocation happens here!
    # ... convolution ...
    return result

Memory allocation is expensive:

Ask the operating system for memory (system call overhead)
Find a suitable free block (memory management overhead)
Initialize the memory to zeros (writing to every byte)
Later, garbage collect the old array (cleanup overhead)

Reuse memory instead. Allocate your arrays once, then write to them in-place:

# Allocate once, outside the loop
result = np.zeros((height, width))
 
for frame in video_frames:
    # Write into existing array, no allocation
    scipy.ndimage.convolve(frame, kernel, output=result)
    process(result)

NumPy and SciPy support in-place operations with the output= or out= parameter:

# This allocates a new array
c = a + b
 
# This writes into existing array
np.add(a, b, out=c)

Going Further: GPUs and Other Tools

For even larger workloads, GPUs can process thousands of operations in parallel using libraries like PyTorch or CuPy. However, for most image processing tasks like our blur operation, the overhead of transferring data between CPU and GPU outweighs the benefits unless you're processing very large batches (thousands of images) or working with 4K+ resolution.

Putting It All Together

Here's how our blur code evolves as we apply each optimization:

The Results

From 631 seconds to under 2 seconds. A 350× speedup without changing the algorithm.

Implementation	Time	Speedup	Description
Pure Python	631s	1×	Nested loops with Python lists
NumPy basic	8.5s	74×	NumPy arrays with scipy.ndimage.convolve
NumPy in-place	3.2s	197×	Reusing arrays with output= parameter
Optimized	1.8s	350×	Optimized convolution with SIMD

Optimization	Speedup	Why
Reuse memory (Python)	1.16×	Fewer allocations
NumPy arrays	75×	Contiguous memory + vectorization
In-place operations	2.6×	No allocation per iteration
Optimized libraries	1.8×	Better cache utilization

The Takeaways

Python lists aren't arrays. They're collections of pointers. For numerical work, use NumPy.
Memory layout matters. Contiguous data gets cache hits. Scattered data gets cache misses. Cache misses are 25× slower.
Vectorize everything. Let NumPy use SIMD instructions instead of writing Python for loops over numerical data.
Allocations are expensive. Create arrays once and reuse them with out= parameters for in-place operations.
Measure before you optimize. Use perf to see what's happening. Cache misses mean memory fragmentation. Too many instructions mean too much Python overhead.
GPUs aren't always faster. Transfer overhead and orchestration can dominate runtime. Profile before assuming GPUs will help.

The fastest code is code that doesn't run. NumPy and SciPy let you express operations at a high level while C code handles low-level optimizations.

What's Next?

This is just the beginning. Other optimization approaches include:

Cython: Compile Python to C for even more speed
Numba: JIT compilation for NumPy code
OpenCV: Highly optimized computer vision library
JAX: NumPy with auto-differentiation and XLA compilation

The principles remain the same: understand your hardware, minimize memory movement, and let specialized libraries do the heavy lifting.

Your CPU is powerful. You just need to feed it data fast enough.