← back to projects

duckdb-behavioral

ClickHouse-style behavioral analytics, inside DuckDB

lang Rust license MIT view on GitHub →

Highlights

  • Seven functions covering sessionization, retention, funnels, and sequence analysis
  • Sessionize at ~830M elements/sec on a Ryzen 7840U (single core, reproducible in one `cargo bench` command, source linked below)
  • NFA-based pattern matcher supporting 2-32 boolean conditions (ClickHouse parity)
  • Bitmask event storage and O(1) combine operations
  • Mutation-tested in CI, installable from DuckDB's community extensions repository

Background

Behavioral analytics, the primitives that power product-analytics dashboards, growth experiments, and user-journey instrumentation, is one of the workloads ClickHouse has historically owned. If you wanted windowFunnel, retention, or sequenceMatch in SQL, your choices were a dedicated columnar cluster, a cloud analytical warehouse, or writing the logic by hand in application code. None of those options match the world where DuckDB has become the default analytics database for where the data actually is.

duckdb-behavioral is a community extension that ports the ClickHouse behavioral-analytics surface into DuckDB, with API parity where parity makes sense and deliberate divergence where DuckDB’s architecture offers a better fit.

The function set

Seven operators, each with a specific purpose:

  • sessionize assigns session IDs based on inactivity gaps, operating as a window function over event streams.
  • retention computes cohort retention matrices across arbitrary time periods.
  • window_funnel tracks user progress through ordered conversion steps within a time window.
  • sequence_match runs NFA-based pattern matching across event sequences.
  • sequence_count counts non-overlapping pattern occurrences.
  • sequence_match_events returns timestamps of matched pattern steps.
  • sequence_next_node identifies what happens immediately before or after a pattern match.

The NFA pattern matcher supports 2-32 boolean conditions per pattern, matching ClickHouse’s ceiling, and the compile step is deterministic so the same query plans the same way twice.

Performance: reproduce it in one command

The headline number is ~830M elements/sec on a single core of an AMD Ryzen 7 7840U (32 GB RAM, Ubuntu 24.04, rustc stable at the time of the last benchmark run). That’s roughly one billion events per 1.2 seconds per core, and it scales close to linearly across cores because the combine phase is O(1).

You don’t have to take my word for any of it. The recipe is:

git clone https://github.com/tomtom215/duckdb-behavioral
cd duckdb-behavioral
cargo bench --bench sessionize

Benchmark source: benches/sessionize.rs. The harness uses criterion, emits its own methodology header (input distribution, gap parameter, iteration count), and writes HTML reports to target/criterion/ so you can compare runs across commits. If you see a different number on different hardware, that’s the point: I’d rather have you reproduce the shape of the number than believe a single one I quoted.

A few design choices combine to get there. Bitmask event storage keeps the hot paths branch-free, so membership in a pattern becomes a bit-test rather than a string comparison or hash lookup. O(1) combine operations let parallel aggregation scale linearly with cores, because DuckDB’s parallel executor can shard the input freely and the combine phase doesn’t become a bottleneck. And NFA compilation is deferred until the pattern is first matched, so query-plan rebinds stay cheap.

Engineering bar

Mutation-tested in CI via cargo-mutants on every pull request; the CI gate lives in the repository’s .github/workflows/ directory alongside the criterion benchmark workflow. The extension is installable directly from DuckDB’s community extensions repository with INSTALL behavioral FROM community;. No compilation, no toolchain, no Docker image.

What it enables downstream

Having these primitives inside DuckDB rather than in a separate warehouse changes what a small team can ship. Funnel analysis, retention cohorts, and sequence matching become things you can embed in a single-binary application (as in mallardmetrics) or drop onto a Raspberry Pi deployment (as in BirdNet-Behavior).

Business impact

For a small team, the tool that used to mean “adopt ClickHouse, run a cluster, learn its ops” now means INSTALL behavioral FROM community; and a DuckDB file. That is the difference between getting product analytics next quarter and having funnels by end of day.