Your program reads a file. Makes an HTTP request. Queries a database. Each time, something strange happens: the CPU sits idle, doing nothing, while your program waits for data to arrive.
In the time it takes to make a single network request (about 1 millisecond), a 2.4 GHz CPU could have completed 2,400,000 instructions. That's a lot of wasted potential. This wasted time is called I/O wait, and it's often the real bottleneck in modern applications.
In this article, we'll explore two powerful techniques for making Python programs faster: asynchronous I/O for I/O-bound tasks, and multiprocessing for CPU-bound tasks. Along the way, you'll understand why Python's threading has limitations, and when to use each approach.
Every time your code reads from a file or writes to a network socket, it must pause and ask the operating system kernel to perform the actual operation. Your program doesn't read from disk directly, the kernel does. And while the kernel is busy fetching data from these slow devices, your program is stuck waiting.
This is called a context switch. Your program's state gets saved, the CPU moves on to other work, and later your program resumes once the data is ready. But here's the thing: during all that waiting time, your program could have been doing something useful.
How do we keep the CPU busy during I/O wait instead of just sitting idle?
Before we dive into solutions, we need to understand what kind of bottleneck we're dealing with. Programs fall into two categories:
CPU-bound tasks spend most time doing calculations. The bottleneck is processor speed.
Examples: Image processing, machine learning training, cryptographic operations, scientific computing
This distinction is crucial. The solution for one type of problem can actually make the other type worse. Async I/O helps I/O-bound programs by doing other work during wait times, but it won't speed up CPU-bound code at all.
For I/O-bound programs, use asynchronous I/O to utilize wait time. For CPU-bound programs, use multiprocessing to harness multiple cores.
These terms get confused often, but they're fundamentally different:
Asynchronous programming gives us concurrency. When one task is waiting for I/O, we switch to another task. All on a single thread, sharing the same memory space. This is simpler than true parallelism because we don't have to worry about two pieces of code modifying the same data at the exact same time.
Since concurrent code runs on a single thread, you avoid many of the headaches of multi-threaded programming: race conditions, deadlocks, and the need for locks. All your async functions share the same memory space, so passing data between them works exactly as you'd expect.
However, you still need to be careful about the order of operations. You can't be sure which lines of code run when, only that they won't run at the exact same time.
At the heart of async programming is the event loop. It's a simple concept: a loop that maintains a queue of tasks and runs them one at a time. When a task needs to wait for I/O, it yields control back to the loop, which picks up the next task.
Here's a toy implementation to illustrate the concept:
from queue import Queue
from functools import partial
eventloop = None
class EventLoop(Queue):
def start(self):
while True:
function = self.get() # Get next task
function() # Run it
def do_hello():
global eventloop
print("Hello")
eventloop.put(do_world) # Schedule next task
def do_world():
global eventloop
print("world")
eventloop.put(do_hello) # Schedule next task
if __name__ == "__main__":
eventloop = EventLoop()
eventloop.put(do_hello)
eventloop.start()Each function, when it finishes or needs to wait, puts the next function onto the queue. The loop keeps running tasks until the queue is empty.
Before Python 3.4, asynchronous programming used callbacks extensively. When you started an async operation, you'd pass a function to be called when the result was ready. This led to deeply nested code known as "callback hell."
With callbacks, the result is only available in the callback function. This leads to nested callbacks (callback hell) for sequential operations.
def save_value(value, callback):
result = f"Hello {value}"
print(f"Saving {result} to database")
save_result_to_db(result, callback)
def print_response(db_response):
print(f"Response: {db_response}")
# Usage - callback hell begins...
save_value("World", print_response)Python 3.4 introduced the asyncio module, and Python 3.5 added the async and await keywords. Now async code looks almost like regular synchronous code:
import asyncio
async def fetch_data(url):
print(f"Fetching {url}...")
await asyncio.sleep(1) # Simulates network delay
return f"Data from {url}"
async def main():
# These run concurrently, not sequentially!
results = await asyncio.gather(
fetch_data("https://api.example.com/users"),
fetch_data("https://api.example.com/posts"),
fetch_data("https://api.example.com/comments"),
)
for result in results:
print(result)
asyncio.run(main())An async function (a coroutine) is built on Python's generator machinery. When you call an async function, it doesn't run immediately, it returns a coroutine object. The code only runs when you await it or when the event loop schedules it.
Calling an async function returns a coroutine object:
coro = fetch_data("url")
# Nothing runs yet!
# coro is just a promise of a future resultYou can't just call an async function like a normal function. You need an event loop to run it. asyncio.run() creates an event loop, runs your coroutine until completion, and cleans up.
Inside an async function, you can await other coroutines directly. But at the top level of your program, you need asyncio.run() to start the whole thing.
Let's compare serial and async approaches to fetching multiple URLs. First, the serial version using the popular requests library:
import requests
import time
def fetch_url(session, url):
response = session.get(url)
return len(response.text)
def run_serial(urls):
with requests.Session() as session:
results = []
for url in urls:
results.append(fetch_url(session, url))
return results
# With 10 URLs taking 100ms each = 1000ms totalEach request waits for the previous one to complete. Now the async version using aiohttp:
import aiohttp
import asyncio
async def fetch_url(session, url):
async with session.get(url) as response:
text = await response.text()
return len(text)
async def run_async(urls):
async with aiohttp.ClientSession() as session:
tasks = [fetch_url(session, url) for url in urls]
results = await asyncio.gather(*tasks)
return results
# With 10 URLs taking 100ms each = ~100ms total!The speedup is dramatic. Instead of waiting for each request sequentially, we fire them all at once and wait for all of them together. The total time is roughly the time of the slowest individual request.
Python 3.11 introduced asyncio.TaskGroup, a cleaner way to manage multiple concurrent tasks with better error handling:
async def run_with_taskgroup(urls):
async with aiohttp.ClientSession() as session:
async with asyncio.TaskGroup() as tg:
tasks = []
for url in urls:
task = tg.create_task(fetch_url(session, url))
tasks.append(task)
# All tasks complete when we exit the TaskGroup
return [task.result() for task in tasks]Tasks don't start immediately when created. They run when the event loop gives them time. In the example above, none of the fetch_url calls start until we reach a point where the current coroutine awaits something.
The asyncio.TaskGroup context manager ensures all tasks complete (or are cancelled on error) before exiting.
Real applications often combine CPU work with I/O operations. Consider a pipeline where you compute something and then save the result to a database. How should you structure this?
Each I/O operation blocks until complete
When your async function does CPU-intensive work, it blocks the event loop. No other tasks can run until it yields. The solution? Periodically yield control with await asyncio.sleep(0):
async def cpu_intensive_with_io():
for i in range(1000):
result = do_heavy_computation(i)
save_to_db(result) # Creates a task, doesn't await
await asyncio.sleep(0) # Let other tasks run!This is arguably the most important line for CPU-heavy async code. Without it, none of your queued I/O tasks would run until the CPU work finishes.
Python has a peculiarity that surprises many developers: the Global Interpreter Lock. The GIL ensures that only one thread executes Python bytecode at a time, even on multi-core machines.
Why does Python have this? The GIL simplifies memory management by preventing race conditions on Python's internal data structures. It makes the interpreter simpler and single-threaded code faster. But it means threading doesn't help for CPU-bound work.
If threads can't run Python code in parallel, how do we utilize multiple CPU cores for CPU-bound work?
Use multiprocessing instead of threading. Each process has its own Python interpreter with its own GIL.
The GIL is released during I/O operations (file reads, network calls) and by certain C extensions (like NumPy). This is why threading still works well for I/O-bound tasks, while one thread waits for I/O, another can run.
Python 3.12+ has experimental support for a free-threaded Python build without the GIL, but it's not production-ready yet.
The multiprocessing module spawns separate Python processes. Each process has its own memory space and its own GIL, allowing true parallel execution.
from multiprocessing import Pool
import time
def cpu_bound_task(n):
"""Simulate CPU-intensive work"""
total = 0
for i in range(n):
total += i ** 2
return total
def run_serial(numbers):
return [cpu_bound_task(n) for n in numbers]
def run_parallel(numbers):
with Pool() as pool:
return pool.map(cpu_bound_task, numbers)
# Serial: processes one at a time
# Parallel: uses all CPU coresMultiprocessing isn't free. There are important costs to consider:
from multiprocessing import Pool, Queue
def worker(task_queue, result_queue):
while True:
task = task_queue.get()
if task is None:
break
result = process(task)
result_queue.put(result)
# Data must be serializable to pass between processes
# Large numpy arrays? Use shared memory instead.For maximum performance, you can combine both approaches: use multiprocessing for CPU-bound work, with each process using async I/O for network operations.
import asyncio
from concurrent.futures import ProcessPoolExecutor
def cpu_task(data):
"""CPU-bound work - runs in separate process"""
return heavy_computation(data)
async def main():
loop = asyncio.get_event_loop()
executor = ProcessPoolExecutor(max_workers=4)
# Run CPU work in process pool
results = await loop.run_in_executor(
executor,
cpu_task,
data
)
# Continue with async I/O
await save_results(results)
asyncio.run(main())We've covered a lot of ground. Here are the key takeaways:
Next steps: Try refactoring a slow I/O-bound script to use async. Measure the speedup. Then identify any CPU-bound bottlenecks and consider multiprocessing for those.