Releases: ArroyoSystems/arroyo
v0.5.0
Arroyo 0.5 brings a number of new features and improvements to the Arroyo platform. The biggest of these is the new FileSystem connector, which is a high-performance, transactional sink for writing data to filesystems and object stores like S3. This allows Arroyo to write into data lakes and data warehouses. We've also added exactly-once support for Kafka sinks, a new Kinesis connector, expanded our SQL support, and made a number of improvements to the Web UI and REST API.
Read on for more details, and check out our docs for full details on existing and new features.
Thanks to all our contributors for this release:
Features
FileSystem connector
Columnar files (like Parquet) on S3 have become the de-facto standard for storing data at rest, combining low cost of storage with decent query performance. Modern query engines like Trino, ClickHouse, and DuckDB can operate directly on these files, as can many data warehouses like Snowflake and Redshift.
And with the new FileSystem connector, Arroyo can efficiently perform real-time ETL into these S3-backed systems.
The FileSystem connector is a high-performance, transactional sink for writing data (as Parquet or JSON files) to file systems and object stores like S3.
It's deeply integrated with Arroyo's checkpoint system for exactly-once processing. This means that even if a machine is lost or a job is restarted, the data written to S3 will be consistent and correct. Unlike other systems like Flink, it's even able to perform consistent checkpointing while in the process of writing a single Parquet file. This means that you can write larger files for better query performance while still performing frequent checkpoints.
Look out for a blog post in the near future with more details on how all of this works.
FileSystem sinks can be created in SQL via a CREATE TABLE statement like this:
CREATE TABLE bids (
time timestamp,
auction bigint,
bidder bigint,
price bigint
) WITH (
connector ='filesystem',
path = 'https://s3.us-west-2.amazonaws.com/demo/s3-uri',
format = 'parquet',
parquet_compression = 'zstd',
rollover_seconds = '60'
);
See the docs for all of the details and available options.
- Commits and parquet file sink by @jacksonrnewhouse in #197
- Add Parquet as a serialization format throughout by @jacksonrnewhouse in #216
Exactly-once Kafka sink
Arroyo has always supported exactly-once processing when reading from Kafka by integrating offset-tracking with its checkpoint system. In 0.5 we're adding exactly-once support for writing to Kafka as well. This enables end-to-end exactly-once processing when integrating with other systems via Kafka.
Exactly-once processing is achieved by leveraging Kafka's transactional API. When processing starts, Arroyo will begin a transaction which is used for all writes.
Once a checkpoint is completed successfully, the transaction is committed, allowing consumers to read the records. This ensures that records are only read once, even if a failure occurs.
If a failure does occur, the transaction will be rolled back and processing will restart from the last checkpoint.
Exactly-once Kafka sinks can be created in SQL via a CREATE TABLE statement by configuring the new 'sink.commit_mode' = 'exactly_once'
option, for example:
CREATE TABLE sink (
time TIMESTAMP,
user_id TEXT,
count INT
) WITH (
connector ='kafka',
topic = 'results',
bootstrap_servers = 'localhost:9092',
type = 'sink',
format = 'json',
'sink.commit_mode' = 'exactly_once'
);
There is also now a corresponding source.read_mode
option for Kafka sources, which can set to read_committed
to read only committed records produced by a transactional producer.
See the Kafka connector docs for more details.
- implement exactly-once commits to Kafka sinks and read_committed reads to Kafka sources by @jacksonrnewhouse in #218
Kinesis connector
Arroyo now supports reading from and writing to AWS Kinesis data streams via the
new Kinesis connector. Like the existing Kafka connector, the Kinesis connector supports exactly-once processing of records.
Kinesis sources and sinks can be created in the Web UI or via SQL, for example
CREATE TABLE kinesis_source (
time TIMESTAMP,
user_id TEXT,
count INT
) WITH (
connector ='kinesis',
stream_name = 'my-source',
type = 'source',
format = 'json'
);
CREATE TABLE kinesis_sink (
time TIMESTAMP,
user_id TEXT,
count INT
) WITH (
connector ='kinesis',
stream_name = 'my-sink',
type = 'sink',
format = 'json'
);
INSERT INTO kinesis_sink
SELECT * from kinesis_source;
See the Kinesis connector docs for all the available options.
- Add Kinesis Source and Sink by @jacksonrnewhouse in #234
Postgres sink via Debezium
Arroyo now supports writing to relational databases (including Postgres and Mysql) via Debezium.
As part of this work, we've added support for embedding JSON schemas in outputs in Kafka Connect format. This allows integration with Kafka Connect connectors that, like Debezium, require a schema.
See the Postgres connector docs for the details.
We've also improved our format system to allow for more control over how data is serialized and deserialized, for example allowing for custom date and timestamp formats. Refer to the new format docs.
Session windows
Arroyo 0.5 adds support for session windows.
Unlike sliding and tumbling windows which divide time up into fixed intervals, session windows are defined by a gap in time between records. This is often useful for determining when some period of activity has finished and can be analyzed.
For example, let's take a query over user events on an ecommerce site. A user may arrive on the site, browse around, add some items to their cart, then disappear. A day later they may return and complete their purchase. With session windows we can independently (and efficiently) analyze each of these sessions.
We can create a session window using the session
function, which takes as an argument that gap time:
SELECT
session(INTERVAL '1 hour') as window,
user_id,
count(*)
FROM clickstream
GROUP BY window, user_id;
Idle watermarks
Partitioned sources (like Kafka or Kinesis) may experience periods when some partitions are active but others are idle due to the way that they are keyed. This can cause delayed processing due to how watermarks are calculated: as the minimum of the watermarks of all partitions.
If some partitions are idle, the watermark will not advance, and queries that depend on it will not make progress. To address this, sources now support a concept of idleness, which allows them to mark partitions as idle after a period of inactivity. Idle partitions, meanwhile, are ignored for the purpose of calculating watermarks and so allow queries to advance.
Idleness is now enabled by default for all sources with a period of 5 minutes. It can be configured when creating a source in SQL by setting the idle_micros
options, or disabled by setting it to -1
.
A special case of idleness occurs when there are more Arroyo source tasks than partitions (for example, a Kafka topic with 4 partitions read by 8 Arroyo tasks). This means that some tasks will never receive data, and so will never advance their watermarks. This can occur as well for non-partitioned sources like WebSocket, where only a single task is able to read data. Now sources will immediately set inactive tasks to idle.
REST API
Continuing the work started in 0.4, we are migrating our API from gRPC to REST. This release includes a number of new endpoints, and can now be used to fully manage pipelines and jobs.
For example, let's walk through creating a new pipeline:
curl http://localhost:8000/api/v1/pipelines \
-X POST -H "Content-Type: application/json" \
--data @- << EOF
{
"name": "my_pipeline",
"parallelism": 1,
"query": "
CREATE TABLE impulse (
counter BIGINT UNSIGNED NOT NULL,
subtask_index BIGINT UNSIGNED NOT NULL
)
WITH (
connector = 'impulse',
event_rate = '100'
);
SELECT * from impulse;",
"udfs": []
}
EOF
{
"id": "pl_W2UjDI6Iud",
"name": "my_pipeline",
"stop": "none",
"createdAt": 1692054789252281,
...
}
Each pipeline has one or more jobs, whic...
v0.4.1
v0.4.0
Overview
Arroyo 0.4 brings some big new features like update tables, Debezium support, and a major redesign of the connectors system that makes it much easier to build new connectors. Leveraging that, we've added websocket and fluvio connectors. We're also releasing the initial endpoints for our new REST API, which makes it easier to build automations around Arroyo.
Read on for more details, and check out our docs for full details on existing and new features.
Thanks to all our contributors for this release:
What's next
With 0.4 out, we're already looking ahead to Arroyo 0.5, to be released in early August. The headline feature of 0.5 will be the new Filesystem connector, which will support high throughput, transactional writes from Arroyo into data warehouses and data lakes backed by object stores like S3. We'll also be finishing the transition to the new REST API, adding Redis and Kinesis connectors, and adding a transactional Kafka sink. On the SQL side we'll be working on session windows and support for joining on external tables.
Anything else you'd like to see? Let us know on Discord!
Now on to the release notes.
Features
Update Tables
Arroyo 0.4 brings support for update tables. Exactly what that means is a bit complicated (and we'll dive into it below) but the short version is that you can now use Arroyo to efficiently read and write data from databases like Postgres and MySQL via Debezium, and many queries that were previously unsupported are now supported.
So what are update tables? Let's talk through the semantics of Arroyo tables today, which we'll call append tables going forward.
Take this query:
SELECT store_id, status from orders;
which produces this output stream:
Time store status
7/10/23, 11:34:34 AM PDT 1142 "accepted"
7/10/23, 11:34:34 AM PDT 1737 "accepted"
7/10/23, 11:34:34 AM PDT 1149 "accepted"
This query will output one row for every record that comes in on the orders stream (let's say that's a kafka topic that receives every order). You can think of this as modeling a virtual table with three columns (time, store, and status). Each new order that comes in produces a new row in that table, or in other words is appended.
But what if we have a query that needs other operations beside appends? For example, consider this query:
SELECT store, count(*) AS count
FROM orders
GROUP BY customer;
which models a table with one row per customer. When a new order comes in, we may append a new row if it's a new customer, or we may need to update an existing row if we've already seen that customer. In other words, we need to support updates.
In Arroyo 0.3 that query is not supported, but in 0.4 it will produce an update stream that looks like this:
Time previous current op
7/10/23, 4:03:42 PM PDT { "orders_store_id": 3, "count": 1 } { "orders_store_id": 3, "count": 2 } "u"
7/10/23, 4:03:40 PM PDT null { "orders_store_id": 1, "count": 1 } "c"
7/10/23, 4:03:40 PM PDT null { "orders_store_id": 3, "count": 1 } "c"
Each output records an update of some kind, either a [c]reate, [u]pdate, or [d]elete. This stream can be used directly, or it can be used to materialize the output into another database like Postgres or MySQL via Debezium, which natively supports this kind of update stream.
Update tables can also be used with Debezium to write to Arroyo from a SQL database CDC source. See the new Debezium tutorial for more details on how to set this up.
Beyond use with Debezium, update tables can also be very useful for efficiently implementing queries where it's important to know when some key enters or leaves a set. For example, for a fraud detection system you may have a set of rules that indicate possibly-fraudulent activity, like this query which looks for sites with suspiciously high click-through rates:
SELECT site as suspicious_site
FROM (
SELECT site, clicks / impressions as click_through_rate
FROM (SELECT site,
SUM(CASE
WHEN imp_type = 'click' THEN 1 ELSE 0 END) as clicks,
SUM(CASE
WHEN imp_type = 'impression' THEN 1 ELSE 0 END) as impressions
FROM event_stream
GROUP BY 1)
) WHERE click_through_rate > 0.02;
This query will produce a record with "op": "c"
whenever a site first exceeds the threshold, and "op": "d"
whenever a site falls below the threshold.
- Updating SQL Queries by @jacksonrnewhouse in #138
Connector redesign
Connectors integrate Arroyo with external systems. They implement sources that read data from external systems and sinks that write data to external systems.
Arroyo 0.4 brings a major redesign of the connectors system, making it much easier to build new connectors. In previous releases of Arroyo, connectors were deeply integrated with the various Arroyo sub-systems (the console, api, database, sql planner, compiler, etc.) and adding or modifying a connector required changes to all of those systems.
In 0.4, connector implementations are cleanly separated out into the new arroyo-connectors
crate. New connectors can be created by implementing a simple trait.
This redesign has allowed us to add a number of new connectors in 0.4 (detailed below), and will accelerate our connector development going forward.
We've also revamped the UI experience around creating sources and sinks, which are now jointly managed in the new Connections tab in the console. This provides a more straightforward experience for creating and managing connections.
Finally, DDL for creating sources and sinks has also been updated to be more consistent and easier to use. For example, a Kafka source can be created with the following SQL:
CREATE TABLE orders (
customer_id INT,
order_id INT
) WITH (
connector = 'kafka',
format = 'json',
bootstrap_servers = 'broker-1.cluster:9092,broker-2.cluster:9092',
topic = 'order_topic',
type = 'source',
'source.offset' = 'earliest'
);
New connectors
Arroyo 0.4 includes a number of new connectors leveraging the connector redesign. See the connector docs the full list of supported connectors.
Websocket sources
Arroyo 0.4 adds a new Websocket source, which allows Arroyo to read data from the many available websocket APIs.
For example, Coinbase provides a websocket API that streams the full orderbook for various cryptocurrencies. We can use the new Websocket source to stream that data into Arroyo, and perform real-time analytics on it.
As a simple example, this query computes the average price of Bitcoin in USD over the last minute:
CREATE TABLE coinbase (
type TEXT,
price TEXT
) WITH (
connector = 'websocket',
endpoint = 'wss://ws-feed.exchange.coinbase.com',
subscription_message = '{
"type": "subscribe",
"product_ids": [
"BTC-USD"
],
"channels": ["ticker"]
}',
format = 'json'
);
SELECT avg(CAST(price as FLOAT)) from coinbase
WHERE type = 'ticker'
GROUP BY hop(interval '5' second, interval '1 minute');
Fluvio source/sink
Arroyo 0.4 adds a new Fluvio source and sink, which allows Arroyo to read and write data from Fluvio, a high-performance distributed streaming platform built on top of Rust and Kubernetes.
Fluvio has support for simple, stateless processing, but with Arroyo it can be extended to perform complex, stateful processing and analytics.
REST API
Today Arroyo is primarily used through the web console, which is great for individual users and small teams. But for more advanced use cases and larger orgs it's important to build automation and integrate Arroyo with internal infrastructure.
Arroyo has always provided a gRPC API that controls all aspects of the system. This is the API that powers the console. But gRPC can be difficult to work with, and it isn't widely supported by existing tools and libraries. We also haven't treated the gRPC API as a stable interface and have made regular breaking changes.
So with this release, we're starting the process of migrating the API to REST, and making it a first-class, stable interface for Arroyo. Arroyo 0.4 adds the first REST endpoints that support pipeline creation, management, and inspection. For example, a SQL pipeline can be created with the following curl command:
curl -XPOST http://localhost:8003/v1/pipelines \
-H "Content-Type: application/json" \
-d '{
"name": "orders",
"query": "SELECT * FROM orders;"
"udfs": [],
"parallelism": 1,
}'
See the [REST API ...
v0.3.0
We're thrilled to announce the 0.3.0 release of Arroyo, our second minor release as an open-source project. Arroyo is a state-of-the-art stream processing engine designed to allow anyone to build complex, stateful real-time data pipelines with SQL.
Overview
The Arroyo 0.3 release focused on improving the flexibility of the system and completeness of SQL support, with the MVP for UDF support, DDL statements, and custom event time and watermarks. There have also many substantial improvements to the Web UI, including error reporting, backpressure monitoring, and under-the-hood infrastructure improvements.
We've also greatly expanded our docs since the last release. Check them out at https://doc.arroyo.dev.
New contributors
We are excited to welcome three new contributors to the project with this release:
- @haoxins made their first contribution in #100
- @edmondop made their first contribution in #122
- @chenquan made their first contribution in #147
Thanks to all new and existing contributors!
What's next
Looking forward to the 0.4 release, we have a lot of exciting changes planned. We're adding the ability to create updating tables with native support for Debezium, allowing users to connect Arroyo to relational databases like MySQL and Postgres. Other planned features include external joins, session windows, and Delta Lake integration.
Excited to be part of the future of stream processing? Come chat with the team on our discord, check out a starter issue and submit a PR, and let us know what you'd like to see next in Arroyo!
Features
UDFs
With this release we are shipping initial support for writing user-defined functions (UDFs) in Rust, allowing users to extend SQL with custom business logic. See the udf docs for full details.
For example, we can register a Rust function:
// Returns the great-circle distance between two coordinates
fn gcd(lat1: f64, lon1: f64, lat2: f64, lon2: f64) -> f64 {
let radius = 6371.0;
let dlat = (lat2 - lat1).to_radians();
let dlon = (lon2 - lon1).to_radians();
let a = (dlat / 2.0).sin().powi(2) +
lat1.to_radians().cos() *
lat2.to_radians().cos() *
(dlon / 2.0).sin().powi(2);
let c = 2.0 * a.sqrt().atan2((1.0 - a).sqrt());
radius * c
}
and call it from SQL:
SELECT gcd(src_lat, src_long, dst_lat, dst_long)
FROM orders;
SQL DDL statements
It's now possible to define sources and sinks directly in SQL via CREATE TABLE statements:
CREATE TABLE orders (
customer_id INT,
order_id INT,
date_string TEXT
) WITH (
connection = 'my_kafka',
topic = 'order_topic',
serialization_mode = 'json'
);
These tables can then be selected from and inserted into to read and write from those systems. For example, we can duplicate the orders topic by inserting from it into a new table:
CREATE TABLE orders_copy (
customer_id INT,
order_id INT,
date_string TEXT
) WITH (
connection = 'my_kafka',
topic = 'order_topic',
serialization_mode = 'json'
);
INSERT INTO orders_copy SELECT * FROM orders;
In addition to connection tables, this release also adds support for views and virtual tables, which are helpful for splitting up complex queries into smaller components.
- Feature/inline create table by @jacksonrnewhouse in #101
- Rework sources and sinks to allow for creating tables/views in SQL queries by @jacksonrnewhouse in #107
Custom event time and watermarks
Arroyo now supports custom event time fields and watermarks, allowing users to define their own event time fields and watermarks based on the data in their streams.
When creating a connection table in SQL, it is now possible to define a virtual field generated from the data in the stream and then assign that to be the event time. We can then generate a watermark from that event time field as well.
A complete example looks like this:
CREATE TABLE orders (
customer_id INT,
order_id INT,
date_string TEXT,
event_time TIMESTAMP GENERATED ALWAYS AS (CAST(date_string as TIMESTAMP)),
watermark TIMESTAMP GENERATED ALWAYS AS (event_time - INTERVAL '15' SECOND)
) WITH (
connection = 'my_kafka',
topic = 'order_topic',
serialization_mode = 'json',
event_time_field = 'event_time',
watermark_field = 'watermark'
);
For more on the underlying concepts of event times and watermarks, see the concept docs.
- Support virtual fields and overriding timestamp via event_time_field by @jacksonrnewhouse in #127
- Add ability to configure watermark by specifying a specific override column by @jacksonrnewhouse in #142
Additional SQL features
Beyond UDFs and DDL statements, we have continued to expand the completeness of our SQL support with addition of case statements and regex functions:
- Allow filters on join computations by @jacksonrnewhouse in #131
- Implement CASE statements by @jacksonrnewhouse in #146
- Adding support for regex_replace and regex_match by @edmondop in #122
- Rework top N window functions by @jacksonrnewhouse in #136
Server-Sent Events source
We've added a new source which allows reading from Server-Sent Events APIs (also called EventSource). SSE is a simple protocol for streaming data from HTTP servers and is a common choice for web applications. See the SSE source documentation for more details, and take a look at the new Mastodon trends tutorial that makes
uses of it
- Add event source source operator by @mwylde in #106
- Add HTTP connections and add support for event source tables in SQL by @mwylde in #119
Web UI
This release has seen a ton of improvements to the web UI.
- Show SQL names instead of primitive types in catalog by @jbeisen in#84
- Add backpressure metric by @jbeisen in #109
- Add backpressure graph and color pipeline nodes by @jbeisen in #110
- Add page not found page by @jbeisen in #130
- Use SWR for fetching data for job details page by @jbeisen in #129
- Show operator checkpoint sizes by @jbeisen in #139
- Write eventsource and kafka source errors to db by @jbeisen in #140
- Add Errors tab to job details page by @jbeisen in #149
Improvements
- Improvements to Kafka consumer/producer reliability and correctness by @mwylde in #132
- Implement full_pipeline_codegen proc macro to test pipeline codegen by @jacksonrnewhouse in #135
- Bump datafusion to 25.0.0 by @jacksonrnewhouse in #145
- Add docker.yaml to build and push docker images. by @jacksonrnewhouse in #150
- Add basic end-to-end integration test by @mwylde in #108
- Add event tracking by @mwylde #144
- Helm: Create service account for Postgres deployment by @haoxins in #100
- Enforce prettier and eslint in the github pipeline by @jbeisen in #120
- Check formatting on PRs by[@jacksonrnewhouse](http...
v0.2.0
Arroyo 0.2.0
Arroyo is a new, state-of-the-art stream processing engine that makes it easy to build complex real-time data pipelines with SQL. This release marks our first versioned release of Arroyo since we open-sourced the engine in April.
We're excited to welcome three new contributors to the project:
- @rtyler made their first contribution in #8
- @akennedy4155 made their first contribution in #49
- @jbeisen made their first contribution in #77
With the 0.2.0 release, we are continuing to push forward on features, stability, and productionization. We’ve added native Kubernetes support and easy deployment via a Helm chart, expanded our SQL support with features like JSON functions and windowless joins, and made many more fixes and improvements detailed below.
Looking forward to the 0.3.0 release, we will continue to improve our SQL support with the ability to create sources and sinks directly as SQL tables, views, UDFs and external joins. We will also be adding a native Pulsar connector and making continued improvements in performance and reliability.
Excited to be part of the future of stream processing? Come chat with the team on our discord, check out a starter issue and submit a PR, and let us know what you’d like to see next in Arroyo!
Features
Native Kubernetes support
As of release 0.2.0, Arroyo can natively target Kubernetes as a scheduler for running pipelines. We now also support easily running the Arroyo control plane on Kubernetes using our new helm chart.
Getting started is as easy as
$ helm repo add arroyo https://arroyosystems.github.io/helm-repo
$ helm install arroyo arroyo/arroyo \
--set s3.bucket=my-bucket,s3.region=us-east-1
See the docs for all the details.
Nomad deployments
Arroyo has long had first-class support for Nomad as a scheduler, where we take advantage of the very low-latency and lightweight scheduling support. Now we also support Nomad as an easy deploy target for the control plane as well via a nomad pack.
See the docs for more details.
SQL features
With this release we are making big improvements in SQL completeness. Notably, we’ve made our JSON support much more flexible with the introduction of SQL JSON functions including get_json_objects
, get_first_json_object
, and extract_json_string
.
We’ve also added support for windowless joins.
Here are some of the highlights:
- Initial JSON functions and raw Kafka Source by @jacksonrnewhouse in #86
- Windowless Joins by @jacksonrnewhouse in #61
- String functions by @jacksonrnewhouse in #17
- Hashing Functions by @akennedy4155 in #49
- Casting between numeric types and strings by @jacksonrnewhouse in #5
- Casting timestamps to text by @jacksonrnewhouse in #32
- String Concat Operator
||
in SQL by @akennedy4155 in #55 - Add COALESCE, NULLIF, MAKE_ARRAY by @jacksonrnewhouse in #89
Connectors, Web UI, and platform support
Arroyo now supports SASL authentication for Kafka and FreeBSD
- Add FreeBSD support by @rtyler in #8, #19
- SASL authentication support to kafka connections by @jacksonrnewhouse in #20
- Add support for changing pipeline parallelism in the Web UI by @jbeisen in #77
Fixes
- Fix filter on partition_by parsing. by @jacksonrnewhouse in #27
- Make parquet state management more reliable by @jacksonrnewhouse in #23
- Fix the quoting of types in the sql package by @jacksonrnewhouse in #64
Improvements
- SQL macro testing by @jacksonrnewhouse in #10
- Add a SQL IR and factor out optimizations by @jacksonrnewhouse in #80
- Multi-arch builds for Docker by @jacksonrnewhouse in #11
- Prometheus and pushgateway in the docker image for working metrics by @mwylde in #16
- Bump datafusion to 23.0, arrow to 37.0 by @jacksonrnewhouse in #92
- Run compiler service locally, compile in debug mode if DEBUG is set by @jacksonrnewhouse in #83
- Replace shelling out to rustfmt with prettyplease by @jacksonrnewhouse in #87
See the full changelog: https://github.com/ArroyoSystems/arroyo/commits/release-0.2.0