Back to Blog

Hello, Streaming Systems!

Picture this: 10,000 sensors across a city, each sending temperature, traffic, and air quality data every second. Your dashboard needs to show real-time averages and alert when pollution spikes above safe levels.

Traditional web backends struggle here. Why? Let's find out by building a streaming system from scratch.


The Synchronous Bottleneck

You start with what you know: an HTTP backend. Each sensor POSTs data to your API endpoint. The server processes it, updates the database, and sends back "OK". Simple, right?

Try it yourself. Click "Send Request" rapidly and watch what happens:

Backend Service vs Streaming
Synchronous Flow
Client
Server
Stats
Total Requests: 0
Status: Idle
β†’ Synchronous: Each request waits for response

Notice how backend requests block. Each sensor must wait for the server to finish before sending the next reading. During peak traffic, rush hour, heatwaves, city events, the queue grows faster than you can process it.

Problem

The blocking problem: When 10,000 sensors send data every second, waiting for each acknowledgment creates a traffic jam. Response times climb from milliseconds to seconds. Your "real-time" dashboard falls minutes behind reality. Critical pollution alerts arrive too late.

The issue isn't your code, it's the synchronous request/response model itself. The client sends, then waits. The server processes, then responds. This coupling creates latency and limits throughput.

Solution

Streaming decouples producers from consumers. Sensors publish data continuously without waiting. Multiple processors consume events in parallel, each at their own pace. The result: consistent low latency even under peak load.


Understanding Queues: The Foundation

Before we build a streaming system, we need one fundamental data structure: the queue.

Think of a queue like a line at a coffee shop. First person in line gets served first. No cutting, no chaos, just orderly, predictable processing.

Queue Data Structure (FIFO)
Front β†’
Event 1
Event 2
Event 3
← Back
FIFO: First In, First Out. The first event added is the first one removed.

Try adding and removing events. New events always join the back of the line. Processing always takes from the front. Order is preserved, Event 1 gets processed before Event 2, always.

This FIFO (First In, First Out) property matters for streaming systems. Producers can add events at their own pace while consumers process at theirs. Events get processed in the sequence they arrived. And if producers temporarily outpace consumers, the queue absorbs the difference instead of dropping data.

Why not just use arrays?

While arrays can act as queues, distributed streaming systems use specialized queue implementations optimized for:

  • Concurrent access (multiple producers and consumers)
  • Persistence (surviving crashes)
  • Backpressure (signaling when queues get too full)

Your First Stream: Producer to Consumer

Now let's see queues in action. A stream is just continuous data flow using queues to connect components.

Event Flow Through System
Source
Stream of Events β†’
Operator

Press "Start Event Flow" and watch events travel from the Source (producer) through the Stream (queue) to the Operator (consumer).

Each component runs independently. The source generates events continuously. Events queue up in the stream. The operator processes events as fast as it can.

Here's what makes this work: if the operator slows down, events queue up, but the source keeps producing. If the operator speeds up, it drains the queue. Neither component blocks the other.


The Five Building Blocks

Every streaming system, from simple scripts to Apache Flink, uses these five concepts:

Streaming System Components
πŸ“₯

Source

Brings data into the streaming system from external sources

Example:
ClickReader

Click each component to explore its role.

Event

A single piece of data flowing through the system. Events are immutable records that represent something that happened: "Temperature = 72Β°F at 2:35pm" or "User clicked /checkout button".

Source

The entry point. Sources bring data from the outside world, sensor readings, log files, message queues, databases, into your streaming system.

Stream

The connection. A continuous flow of events from one component to another, implemented using queues under the hood.

Operator

The processor. Where your business logic lives. Operators receive events, transform them, aggregate them, filter them, whatever you need.

Job

The complete application. A job connects sources, streams, and operators into a directed graph that processes events continuously.


Building Your First Streaming Job

Let's build something real: a clickstream analytics engine that counts page views by URL in real-time.

We'll use the StreamKit framework, which I built specifically for teaching. It's simplified on purpose, no distributed coordination, no fault tolerance, none of the complexity that makes production frameworks hard to understand. Just the core ideas.

Step 1: Define Your Event

Every streaming system starts by defining what an "event" means. For clickstream analytics, each event represents one page view:

click_event.pypython
class ClickEvent(Event):
  def __init__(self, url):
      self.url = url  # The page that was viewed

  def get_data(self):
      return self.url  # Return the URL for processing

An event is just a wrapper around your data. Keep it simple, events should be immutable records, not complex objects.

Step 2: Build the Source

A source brings data into the streaming system. Our ClickReader listens on a network port for incoming page view events:

click_reader.pypython
class ClickReader(Source):
  def __init__(self, name, port):
      super().__init__(name)
      self.reader = self._setup_socket_reader(port)

  def get_events(self, event_collector):
      # Framework calls this method continuously
      url = self.reader.readline().strip()
      event_collector.append(ClickEvent(url))
      print(f"ClickReader --> {url}")
Lifecycle Hooks Explained

The get_events() method is a lifecycle hook, the framework calls it repeatedly in an infinite loop. You write the logic (read data, create events), the framework handles the heavy lifting (threading, queues, scheduling).

This is how all streaming frameworks work: you implement hooks, the framework orchestrates execution.

The source doesn't know or care what happens to events after it produces them. It just keeps reading and emitting.

Step 3: Build the Operator

Operators process events. Our PageViewCounter maintains a running count for each URL:

page_view_counter.pypython
class PageViewCounter(Operator):
  def __init__(self, name):
      super().__init__(name)
      self.count_map = {}  # URL -> count

  def apply(self, event, collector):
      # Framework calls this for each event
      url = event.get_data()
      count = self.count_map.get(url, 0)
      count += 1
      self.count_map[url] = count

      print("PageViewCounter -->")
      self._print_count_map()

  def _print_count_map(self):
      for url, count in self.count_map.items():
          print(f"  {url}: {count}")

The operator receives one event at a time through the apply() hook. It updates state (the count map) and outputs results. Simple, focused, testable.

Step 4: Connect the Components

Now we assemble the job by connecting source β†’ stream β†’ operator:

clickstream_job.pypython
if __name__ == "__main__":
  # 1. Create the job
  job = Job()

  # 2. Add the source, which returns a stream
  click_stream = job.add_source(ClickReader("click-reader", 9990))

  # 3. Apply the operator to the stream
  click_stream.apply_operator(PageViewCounter("page-view-counter"))

  # 4. Start execution
  starter = JobStarter(job)
  starter.start()

Three lines of configuration create a complete streaming pipeline. The framework handles threading, queues, and scheduling automatically.


See It Run: Live Demo

Try the streaming job yourself. Type URLs into the input terminal and watch the counter update in real-time:

Live Streaming Job: Clickstream Analytics
Input Terminal (Click Events)
$ nc -lk 9990
Job Output (Page View Counter)
$ python clickstream_job.py
PageViewCounter -->
Waiting for events...

Notice a few things here. Counts update instantly as you type, there's no batching or delay. The operator maintains running totals across all events, keeping state in memory. And the system never stops, just keeps processing events as they arrive.

That's streaming in a nutshell: data flows continuously, processing happens immediately, state accumulates over time.


Under the Hood: How Execution Works

You've seen the job run, typing URLs and watching counts update. But how does the framework actually execute your code?

The answer is simpler than you might think. The StreamKit framework has three moving parts, and they're all just infinite loops pulling from queues:

1. Source Executor: The Producer Loop

Source executors run sources in infinite loops, pulling data from the outside world:

source_executor.pypython
class SourceExecutor(threading.Thread):
  def __init__(self, source, outgoing_queue):
      super().__init__()
      self.source = source
      self.outgoing_queue = outgoing_queue

  def run(self):
      while True:  # Infinite loop!
          events = []

          # Call user's lifecycle hook
          self.source.get_events(events)

          # Add events to queue for downstream
          for event in events:
              self.outgoing_queue.put(event)

Your get_events() hook gets called continuously. The executor handles queue management and threading.

2. Operator Executor: The Consumer Loop

Operator executors pull events from incoming queues and apply user logic:

operator_executor.pypython
class OperatorExecutor(threading.Thread):
  def __init__(self, operator, incoming_queue, outgoing_queue):
      super().__init__()
      self.operator = operator
      self.incoming_queue = incoming_queue
      self.outgoing_queue = outgoing_queue

  def run(self):
      while True:  # Infinite loop!
          try:
              # Block until an event arrives
              event = self.incoming_queue.get(timeout=0.1)
              output = []

              # Call user's lifecycle hook
              self.operator.apply(event, output)

              # Pass results downstream
              for e in output:
                  self.outgoing_queue.put(e)
          except queue.Empty:
              continue  # No events, keep looping

Notice the pattern: infinite loop + user hook + queue management. This is the core of every streaming framework.

3. Job Starter: The Orchestrator

The job starter wires everything together:

job_starter.pypython
class JobStarter:
  def __init__(self, job):
      self.job = job

  def start(self):
      # 1. Create and start source executors
      for source in self.job.get_sources():
          executor = SourceExecutor(source, source.outgoing_queue)
          executor.start()  # Starts the thread

      # 2. Create and start operator executors
      for operator in self.job.get_operators():
          executor = OperatorExecutor(
              operator,
              operator.incoming_queue,
              operator.outgoing_queue
          )
          executor.start()  # Starts the thread

      # 3. Connect queues between components
      self._setup_queues()

Each component runs in its own thread. Queues connect the threads. The framework manages concurrency so you don't have to.

Watch the Executor Lifecycle

Executor Lifecycle
1
Check if there are events available
2
Read event from input port
3
Create Event object
4
Add to outgoing queue
5
Repeat (infinite loop)
⏸️ Executor paused

Press "Start" and watch the executor loop through its steps continuously. This is what's happening inside the framework while your job runs.


Tracing an Event's Journey

Let's follow a single event through the entire system. You type /home and press enter. Here's what happens:

  1. Input arrives: The string /home is received by the socket server on port 9990
  2. Source lifecycle triggered: Framework calls ClickReader.get_events() in its infinite loop
  3. Event created: A new ClickEvent("/home") object is instantiated
  4. Queued: Event is added to the source's outgoing queue
  5. Dequeued: Operator executor pulls event from its incoming queue (blocking if empty)
  6. Operator lifecycle triggered: Framework calls PageViewCounter.apply(event, collector)
  7. State updated: Count map becomes {"/home": 1}
  8. Output: Results are printed to the terminal

All of this happens in milliseconds. The system is immediately ready for the next event.


From Theory to Practice

You've now built a complete streaming system. You defined events and components, connected them into a job, watched execution happen in real-time, and understood how the framework works internally.

But this is the simple version. Real streaming systems face harder problems:

Advanced streaming concepts

Parallelization: running multiple operator instances to increase throughput. Windowing: grouping events by time (last 5 minutes) or count (every 100 events). Delivery semantics: guaranteeing events are processed exactly once, even across failures. Backpressure: slowing down producers when consumers can't keep up. Stateful computation: checkpointing state so it survives crashes. Event time vs processing time: handling out-of-order events correctly.

We'll cover these in future articles. For now, you understand the foundation. Queues connect components. Executors run infinite loops. Events flow continuously. That's enough to build real systems.


Going Production

StreamKit is educational. For production systems, use battle-tested frameworks.

Apache Flink handles exactly-once processing, advanced windowing, and has SQL support. Apache Kafka Streams integrates tightly with Kafka and deploys simply. Apache Pulsar does multi-tenancy, geo-replication, and tiered storage. Apache Storm optimizes for low latency and supports many languages.

All of these implement the same concepts you learned, sources, operators, streams, executors, but add fault tolerance, distributed execution, and operational tooling.


Key Takeaways

Traditional backends block on each request. Streaming systems decouple producers from consumers using queues, which enables continuous, low-latency processing.

The building blocks are simple: Events carry data, Sources produce events, Streams connect components via queues, Operators process events, and Jobs tie everything together.

Under the hood, executors run components in infinite loops, calling your lifecycle hooks continuously and managing queues automatically.

Use streaming when you need real-time processing of continuous data: analytics, monitoring, fraud detection, recommendations, IoT sensors, log processing. If you can batch it and wait, batch it and wait. But if milliseconds matter, streaming is how you get there.


Exercises

Want to go deeper? Try these:

  1. Create an operator that only passes events where the URL starts with /products/. Place it between the source and counter.

  2. Modify the counter to track views per URL per minute, resetting counts every 60 seconds.

  3. Add a second source that generates synthetic events automatically. Watch both sources feed the same counter.

  4. Send events from one source to two different operators, one that counts, one that logs to a file. Implement this using two streams.

  5. Make the counter write its state to disk every 10 seconds. On restart, load the state so counts don't reset to zero.

  6. Add a slow operator that takes 5 seconds to process each event. Watch the queue grow. How would you prevent memory overflow?


Next: In the next article, we'll explore parallelization and data grouping, how to run multiple instances of operators to handle higher throughput, and why ordering matters when you do.