Skip to content

Updating Tables

Arroyo supports two semantics for streaming SQL, which we call Dataflow semantics and updating semantics. There’s a core problem of trying to execute a SQL query on an unbounded, streaming source: how do we know when to compute aggregates and joins, given that we will always see more data in the future?

In Dataflow semantics, which are introduced with the use of time-oriented windows, like HOP and TUMBLE, we compute aggregates for a window once the watermark passes. This is a powerful model, as allows the streaming system to signal completeness to its consumers, so they don’t need to reason about when they are able to trust the results. However, the requirement that all computations are windowed can be limiting.

Updating semantics, on the other hand, allow for more flexibility in the types of queries that can be expressed, including most of normal batch SQL. It works by treating the input stream as a table that is constantly being updated, and the output as updates on a materialized view of the query.

When writing to a sink, the output is a stream of updates (in Debezium format), representing additions, updates, and deletes to the materialized view.

Source connectors such as kafka can specify the format as 'debezium_json' to read Debezium formatted messages.

Updating sources need at least one primary key, which tells Arroyo which rows are logically the same. The primary key is specified in the DDL, like this:

CREATE TABLE debezium_source (
id INT PRIMARY KEY,