Python Async I/O & Multiprocessing

Your program reads a file. Makes an HTTP request. Queries a database. Each time, something strange happens: the CPU sits idle, doing nothing, while your program waits for data to arrive.

In the time it takes to make a single network request (about 1 millisecond), a 2.4 GHz CPU could have completed 2,400,000 instructions. That's a lot of wasted potential. This wasted time is called I/O wait, and it's often the real bottleneck in modern applications.

In this article, we'll explore two powerful techniques for making Python programs faster: asynchronous I/O for I/O-bound tasks, and multiprocessing for CPU-bound tasks. Along the way, you'll understand why Python's threading has limitations, and when to use each approach.

What you'll learn

The difference between CPU-bound and I/O-bound programs
How event loops enable concurrency on a single thread
Python's async/await syntax and how it works
Practical patterns for async HTTP requests
Why the GIL limits threading for CPU work
When and how to use multiprocessing

The Problem: Waiting for Nothing

Every time your code reads from a file or writes to a network socket, it must pause and ask the operating system kernel to perform the actual operation. Your program doesn't read from disk directly, the kernel does. And while the kernel is busy fetching data from these slow devices, your program is stuck waiting.

I/O Wait Explained

CPU (Your Program)

Working

I/O Device (Disk/Network)

Idle

Program requests data from disk/network

This is called a context switch. Your program's state gets saved, the CPU moves on to other work, and later your program resumes once the data is ready. But here's the thing: during all that waiting time, your program could have been doing something useful.

Problem

How do we keep the CPU busy during I/O wait instead of just sitting idle?

CPU-Bound vs I/O-Bound

Before we dive into solutions, we need to understand what kind of bottleneck we're dealing with. Programs fall into two categories:

CPU-Bound vs I/O-Bound

CPU-bound tasks spend most time doing calculations. The bottleneck is processor speed.

■ CPU Work (90%)■ I/O Wait (10%)

Examples: Image processing, machine learning training, cryptographic operations, scientific computing

This distinction is crucial. The solution for one type of problem can actually make the other type worse. Async I/O helps I/O-bound programs by doing other work during wait times, but it won't speed up CPU-bound code at all.

Solution

For I/O-bound programs, use asynchronous I/O to utilize wait time. For CPU-bound programs, use multiprocessing to harness multiple cores.

Concurrency vs Parallelism

These terms get confused often, but they're fundamentally different:

Concurrency is about dealing with multiple things at once. Tasks are interleaved, you work on one, pause, work on another, come back.
Parallelism is about doing multiple things at once. Tasks run simultaneously on different CPU cores.

Serial vs Concurrent Execution

CPU Work

I/O Wait

Task A

Task B

Task C

Total Time: 210ms

Asynchronous programming gives us concurrency. When one task is waiting for I/O, we switch to another task. All on a single thread, sharing the same memory space. This is simpler than true parallelism because we don't have to worry about two pieces of code modifying the same data at the exact same time.

Why concurrency on one thread?

Since concurrent code runs on a single thread, you avoid many of the headaches of multi-threaded programming: race conditions, deadlocks, and the need for locks. All your async functions share the same memory space, so passing data between them works exactly as you'd expect.

However, you still need to be careful about the order of operations. You can't be sure which lines of code run when, only that they won't run at the exact same time.

The Event Loop

At the heart of async programming is the event loop. It's a simple concept: a loop that maintains a queue of tasks and runs them one at a time. When a task needs to wait for I/O, it yields control back to the loop, which picks up the next task.

Event Loop Simulation

Task Queue

→do_hello()

Running

Idle

Output

// Output will appear here

Here's a toy implementation to illustrate the concept:

from queue import Queue
from functools import partial
 
eventloop = None
 
class EventLoop(Queue):
    def start(self):
        while True:
            function = self.get()  # Get next task
            function()             # Run it
 
def do_hello():
    global eventloop
    print("Hello")
    eventloop.put(do_world)  # Schedule next task
 
def do_world():
    global eventloop
    print("world")
    eventloop.put(do_hello)  # Schedule next task
 
if __name__ == "__main__":
    eventloop = EventLoop()
    eventloop.put(do_hello)
    eventloop.start()

Each function, when it finishes or needs to wait, puts the next function onto the queue. The loop keeps running tasks until the queue is empty.

From Callbacks to Async/Await

Before Python 3.4, asynchronous programming used callbacks extensively. When you started an async operation, you'd pass a function to be called when the result was ready. This led to deeply nested code known as "callback hell."

Callbacks vs Async/Await

With callbacks, the result is only available in the callback function. This leads to nested callbacks (callback hell) for sequential operations.

def save_value(value, callback):
    result = f"Hello {value}"
    print(f"Saving {result} to database")
    save_result_to_db(result, callback)

def print_response(db_response):
    print(f"Response: {db_response}")

# Usage - callback hell begins...
save_value("World", print_response)

Python 3.4 introduced the asyncio module, and Python 3.5 added the async and await keywords. Now async code looks almost like regular synchronous code:

import asyncio
 
async def fetch_data(url):
    print(f"Fetching {url}...")
    await asyncio.sleep(1)  # Simulates network delay
    return f"Data from {url}"
 
async def main():
    # These run concurrently, not sequentially!
    results = await asyncio.gather(
        fetch_data("https://api.example.com/users"),
        fetch_data("https://api.example.com/posts"),
        fetch_data("https://api.example.com/comments"),
    )
    for result in results:
        print(result)
 
asyncio.run(main())

How Does async/await Work?

An async function (a coroutine) is built on Python's generator machinery. When you call an async function, it doesn't run immediately, it returns a coroutine object. The code only runs when you await it or when the event loop schedules it.

Async Function Execution

Calling an async function returns a coroutine object:

coro = fetch_data("url")
# Nothing runs yet!
# coro is just a promise of a future result

The importance of asyncio.run()

You can't just call an async function like a normal function. You need an event loop to run it. asyncio.run() creates an event loop, runs your coroutine until completion, and cleans up.

Inside an async function, you can await other coroutines directly. But at the top level of your program, you need asyncio.run() to start the whole thing.

Practical Example: Web Scraping

Let's compare serial and async approaches to fetching multiple URLs. First, the serial version using the popular requests library:

import requests
import time
 
def fetch_url(session, url):
    response = session.get(url)
    return len(response.text)
 
def run_serial(urls):
    with requests.Session() as session:
        results = []
        for url in urls:
            results.append(fetch_url(session, url))
    return results
 
# With 10 URLs taking 100ms each = 1000ms total

Each request waits for the previous one to complete. Now the async version using aiohttp:

import aiohttp
import asyncio
 
async def fetch_url(session, url):
    async with session.get(url) as response:
        text = await response.text()
        return len(text)
 
async def run_async(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_url(session, url) for url in urls]
        results = await asyncio.gather(*tasks)
    return results
 
# With 10 URLs taking 100ms each = ~100ms total!

Concurrent HTTP Requests

Completed: 0/10Elapsed: 0.0s

Expected time: 10s (10 requests x 1s each)

The speedup is dramatic. Instead of waiting for each request sequentially, we fire them all at once and wait for all of them together. The total time is roughly the time of the slowest individual request.

Understanding TaskGroups

Python 3.11 introduced asyncio.TaskGroup, a cleaner way to manage multiple concurrent tasks with better error handling:

async def run_with_taskgroup(urls):
    async with aiohttp.ClientSession() as session:
        async with asyncio.TaskGroup() as tg:
            tasks = []
            for url in urls:
                task = tg.create_task(fetch_url(session, url))
                tasks.append(task)
 
        # All tasks complete when we exit the TaskGroup
        return [task.result() for task in tasks]

When do tasks actually run?

Tasks don't start immediately when created. They run when the event loop gives them time. In the example above, none of the fetch_url calls start until we reach a point where the current coroutine awaits something.

The asyncio.TaskGroup context manager ensures all tasks complete (or are cancelled on error) before exiting.

Mixing CPU and I/O Work

Real applications often combine CPU work with I/O operations. Consider a pipeline where you compute something and then save the result to a database. How should you structure this?

Batching Strategies

Serial

Each I/O operation blocks until complete

CPU

I/O Wait

CPU

I/O Wait

CPU

I/O Wait

Total Time: 120ms

The asyncio.sleep(0) Trick

When your async function does CPU-intensive work, it blocks the event loop. No other tasks can run until it yields. The solution? Periodically yield control with await asyncio.sleep(0):

async def cpu_intensive_with_io():
    for i in range(1000):
        result = do_heavy_computation(i)
        save_to_db(result)  # Creates a task, doesn't await
        await asyncio.sleep(0)  # Let other tasks run!

This is arguably the most important line for CPU-heavy async code. Without it, none of your queued I/O tasks would run until the CPU work finishes.

The Global Interpreter Lock (GIL)

Python has a peculiarity that surprises many developers: the Global Interpreter Lock. The GIL ensures that only one thread executes Python bytecode at a time, even on multi-core machines.

The Global Interpreter Lock (GIL)

Thread 1

Running

HAS GIL

With one thread, the GIL is not a problem

Why does Python have this? The GIL simplifies memory management by preventing race conditions on Python's internal data structures. It makes the interpreter simpler and single-threaded code faster. But it means threading doesn't help for CPU-bound work.

Problem

If threads can't run Python code in parallel, how do we utilize multiple CPU cores for CPU-bound work?

Solution

Use multiprocessing instead of threading. Each process has its own Python interpreter with its own GIL.

When does the GIL get released?

The GIL is released during I/O operations (file reads, network calls) and by certain C extensions (like NumPy). This is why threading still works well for I/O-bound tasks, while one thread waits for I/O, another can run.

Python 3.12+ has experimental support for a free-threaded Python build without the GIL, but it's not production-ready yet.

Multiprocessing: True Parallelism

The multiprocessing module spawns separate Python processes. Each process has its own memory space and its own GIL, allowing true parallel execution.

Threading vs Multiprocessing

Single Process - Shared Memory

Thread 1

Thread 2

Thread 3

Shared Data (with GIL lock)

Lightweight - fast to create
Shared memory - easy data sharing
Limited by GIL for CPU-bound work
Great for I/O-bound tasks

from multiprocessing import Pool
import time
 
def cpu_bound_task(n):
    """Simulate CPU-intensive work"""
    total = 0
    for i in range(n):
        total += i ** 2
    return total
 
def run_serial(numbers):
    return [cpu_bound_task(n) for n in numbers]
 
def run_parallel(numbers):
    with Pool() as pool:
        return pool.map(cpu_bound_task, numbers)
 
# Serial: processes one at a time
# Parallel: uses all CPU cores

Trade-offs

Multiprocessing isn't free. There are important costs to consider:

Process creation overhead: Starting a new process is much slower than starting a thread.
Memory usage: Each process has its own memory space, duplicating data that threads would share.
Inter-process communication: Sharing data between processes requires serialization (pickling), which is slow.

from multiprocessing import Pool, Queue
 
def worker(task_queue, result_queue):
    while True:
        task = task_queue.get()
        if task is None:
            break
        result = process(task)
        result_queue.put(result)
 
# Data must be serializable to pass between processes
# Large numpy arrays? Use shared memory instead.

Combining Async and Multiprocessing

For maximum performance, you can combine both approaches: use multiprocessing for CPU-bound work, with each process using async I/O for network operations.

import asyncio
from concurrent.futures import ProcessPoolExecutor
 
def cpu_task(data):
    """CPU-bound work - runs in separate process"""
    return heavy_computation(data)
 
async def main():
    loop = asyncio.get_event_loop()
    executor = ProcessPoolExecutor(max_workers=4)
 
    # Run CPU work in process pool
    results = await loop.run_in_executor(
        executor,
        cpu_task,
        data
    )
 
    # Continue with async I/O
    await save_results(results)
 
asyncio.run(main())

When to use what

I/O-bound, single task: Regular synchronous code is fine
I/O-bound, many tasks: Use asyncio
CPU-bound, single task: Regular code (can't parallelize anyway)
CPU-bound, many tasks: Use multiprocessing
Mixed workload: Combine asyncio with ProcessPoolExecutor

Summary

We've covered a lot of ground. Here are the key takeaways:

I/O wait is the enemy: Your program spends most of its time waiting for slow devices. Async programming lets you use that time.
Concurrency ≠ Parallelism: Async gives you concurrency (interleaved execution) on a single thread. Multiprocessing gives you parallelism (simultaneous execution) across cores.
Event loops manage async code: They maintain a queue of tasks and switch between them when one needs to wait.
async/await makes async code readable: What used to require callback chains now looks like sequential code.
The GIL limits threading: For CPU-bound work, threads don't help in Python. Use multiprocessing instead.
Choose the right tool: Profile your code first to know whether you're I/O-bound or CPU-bound, then pick the appropriate technique.

Next steps: Try refactoring a slow I/O-bound script to use async. Measure the speedup. Then identify any CPU-bound bottlenecks and consider multiprocessing for those.