← back to projects

mallardmetrics

Privacy-first web analytics as a single binary

lang Rust license AGPL-3.0 view on GitHub →

Highlights

  • Single static binary with no external dependencies
  • IP addresses processed ephemerally in RAM, never persisted
  • Daily-rotating HMAC-SHA256 visitor hashes (IP + UA + salt) that reset every 24 hours
  • Optional GDPR mode with country-only geolocation, hourly-bucketed timestamps, no fingerprints
  • Two-tier storage - hot DuckDB writes with cold Parquet (ZSTD) archival
  • Funnel analysis, cohort retention, and sequence matching via the behavioral extension

Background

Mallard Metrics started from a question: if the behavioral-analytics primitives from duckdb-behavioral let you do cohort retention and funnel analysis inside DuckDB, what’s the smallest possible analytics platform you can ship on top of them? The answer turned out to be one Rust binary, one DuckDB file, and zero external services.

Why build another analytics tool

The current generation of privacy-respecting web analytics has proved the market, but most of those tools still require a database server, a reverse proxy, and an ops burden that scales poorly for small teams and self-hosters. Mallard Metrics collapses the stack: the application, the database (DuckDB), the analytics engine (the behavioral extension), and the cold-storage writer all live in a single Rust process, deployed as a single static binary. There’s no network hop to the database, no separate query worker, no background job queue.

The collapse has a second-order benefit. Because everything runs inside one address space, the privacy model can be enforced in code rather than in configuration. There’s no connection string that could accidentally point at an IP-logging database, because the IPs never leave the process’s RAM in the first place.

The privacy architecture

The interesting property is not “we don’t store IPs.” Plenty of tools claim that. The property is that IP addresses are never written to disk at all. They enter the ingestion pipeline, get hashed into a daily-rotating visitor ID alongside the User-Agent and a rotating salt, and then the raw IP is dropped before the row is persisted. Because the salt rotates every 24 hours, the same visitor appears as a different ID tomorrow, so you get daily uniques without a persistent tracking primitive.

For teams under strict data-residency or GDPR regimes, a second switch enables country-level geolocation only, hourly-bucketed timestamps, and zero device fingerprints. The same codebase, different defaults.

The architecture

Three layers, each doing one thing well:

  1. Ingestion handles origin validation, bot filtering, GeoIP resolution, and in-memory buffering. IP addresses live in this layer, in RAM, for the milliseconds it takes to hash them.
  2. Storage is hot DuckDB writes for recent data, with cold Parquet archival using ZSTD compression. Old data ages into columnar files that stay queryable but take a fraction of the space.
  3. Query is TTL-cached DuckDB analytical functions, including funnels and retention from the behavioral extension. Dashboards hit the cache first, and only cold or uncached ranges pay the full query cost.

Design trade-offs

DuckDB over SQLite for the hot tier is the main deliberate choice. Behavioral-analytics queries (sessionize, retention, funnels) are columnar-shaped workloads, and DuckDB is a much better fit than a row-oriented store. The cost is a slightly larger runtime footprint, which is a reasonable trade for an analytics tool.

What it’s demonstrating

Mallard Metrics is, in part, a showcase for what the Rust plus DuckDB plus behavioral-extension stack can do as a platform. A full self-hosted analytics product with funnels and retention in a single static binary was genuinely hard to build in 2023 and became a weekend project once the pieces were in place.