duckdb-behavioral
ClickHouse-style behavioral analytics, inside DuckDB
Highlights
- Seven functions covering sessionization, retention, funnels, and sequence analysis
- Sessionize at ~830M elements/sec on a Ryzen 7840U (single core, reproducible in one `cargo bench` command, source linked below)
- NFA-based pattern matcher supporting 2-32 boolean conditions (ClickHouse parity)
- Bitmask event storage and O(1) combine operations
- Mutation-tested in CI, installable from DuckDB's community extensions repository
Background
Behavioral analytics, the primitives that power product-analytics
dashboards, growth experiments, and user-journey instrumentation,
is one of the workloads ClickHouse has historically owned. If you
wanted windowFunnel, retention, or sequenceMatch in SQL, your
choices were a dedicated columnar cluster, a cloud analytical
warehouse, or writing the logic by hand in application code. None
of those options match the world where DuckDB has become the
default analytics database for where the data actually is.
duckdb-behavioral is a community extension that ports the
ClickHouse behavioral-analytics surface into DuckDB, with API
parity where parity makes sense and deliberate divergence where
DuckDB’s architecture offers a better fit.
The function set
Seven operators, each with a specific purpose:
sessionizeassigns session IDs based on inactivity gaps, operating as a window function over event streams.retentioncomputes cohort retention matrices across arbitrary time periods.window_funneltracks user progress through ordered conversion steps within a time window.sequence_matchruns NFA-based pattern matching across event sequences.sequence_countcounts non-overlapping pattern occurrences.sequence_match_eventsreturns timestamps of matched pattern steps.sequence_next_nodeidentifies what happens immediately before or after a pattern match.
The NFA pattern matcher supports 2-32 boolean conditions per pattern, matching ClickHouse’s ceiling, and the compile step is deterministic so the same query plans the same way twice.
Performance: reproduce it in one command
The headline number is ~830M elements/sec on a single core
of an AMD Ryzen 7 7840U (32 GB RAM, Ubuntu 24.04, rustc stable
at the time of the last benchmark run). That’s roughly one
billion events per 1.2 seconds per core, and it scales close to
linearly across cores because the combine phase is O(1).
You don’t have to take my word for any of it. The recipe is:
git clone https://github.com/tomtom215/duckdb-behavioral
cd duckdb-behavioral
cargo bench --bench sessionize
Benchmark source:
benches/sessionize.rs.
The harness uses criterion,
emits its own methodology header (input distribution, gap
parameter, iteration count), and writes HTML reports to
target/criterion/ so you can compare runs across commits.
If you see a different number on different hardware, that’s
the point: I’d rather have you reproduce the shape of the
number than believe a single one I quoted.
A few design choices combine to get there. Bitmask event storage keeps the hot paths branch-free, so membership in a pattern becomes a bit-test rather than a string comparison or hash lookup. O(1) combine operations let parallel aggregation scale linearly with cores, because DuckDB’s parallel executor can shard the input freely and the combine phase doesn’t become a bottleneck. And NFA compilation is deferred until the pattern is first matched, so query-plan rebinds stay cheap.
Engineering bar
Mutation-tested in CI via
cargo-mutants on
every pull request; the CI gate lives in the repository’s
.github/workflows/ directory alongside the criterion benchmark
workflow. The extension is installable directly from DuckDB’s
community extensions repository with
INSTALL behavioral FROM community;. No compilation, no
toolchain, no Docker image.
What it enables downstream
Having these primitives inside DuckDB rather than in a separate warehouse changes what a small team can ship. Funnel analysis, retention cohorts, and sequence matching become things you can embed in a single-binary application (as in mallardmetrics) or drop onto a Raspberry Pi deployment (as in BirdNet-Behavior).
Business impact
For a small team, the tool that used to mean “adopt ClickHouse,
run a cluster, learn its ops” now means
INSTALL behavioral FROM community; and a DuckDB file. That is
the difference between getting product analytics next quarter and
having funnels by end of day.