Benchmarks | Codag

tl;dr · matches Drain3 grouping on 42k LogHub lines (GA 0.77 · purity 0.98) · +0.111 FTA, paired 95% CI excludes 0 · 168×/41× line/char compression · at 3k-line windows beats raw +0.146 & Drain3 +0.078 on agent diagnosis (p<0.01) at ~10% of raw tokens

01 · deterministic parser

Grouping quality on LogHub-2.0

3,000 lines × 14 systems = 42,000 oracle-labeled lines. No model calls, fully reproducible from the repo. CIs are non-parametric bootstrap over the 14 systems (macro-system mean), so correlated lines inside one system can't manufacture significance. There is no grouping metric for ungrouped raw logs, so the control appears only in compression and the agent eval below.

codag-drain vs Drain3 · grouping metrics

bars = macro-system mean · whiskers = bootstrap 95% CI over 14 systems · higher is better

Honest read: GA, FGA and purity are identical — codag-drain does not discover different member groups than Drain3. The separation is FTA?: codag-drain renders a more oracle-like template string for the same groups, giving a paired +0.111 FTA over Drain3 (95% CI [+0.027, +0.230]).

168×

line compression
CI [68×, 321×] vs raw (1×)

41×

char compression
CI [17×, 80×] of rendered template

88k

lines/sec grouping · single core
0.84× Drain3, no GPU

34k

lines/sec full render
+ slot summaries & samples

codag-drain is slightly slower than bare Drain3 because the render path also derives the template string, captures slot summaries, and selects bounded raw samples — the artifact an agent actually reads. It is still sub-millisecond per line on one core.

02 · agent-serving eval

Does the artifact help an agent diagnose?

80 labeled incident windows. For each, we build one blinded artifact per arm, ask gpt-5.5 to diagnose it with no gold labels, then a separate blind judge scores each diagnosis against the gold root cause. Raw is capped at the serving budget (80k chars ≈ 20k tokens) — this is a practical context limit, not an infinite-log oracle. Deltas are paired, with bootstrap CIs and one-sided permutation p-values.

Diagnosis score by window size

bars = mean blind-judge score (0–1) · whiskers = bootstrap 95% CI · n=80 incidents per arm per size

The crossover is the whole story. At 300 lines the raw control fits the budget and wins — we do not claim codag-drain beats raw on small windows. At 3,000 lines the raw log overflows the budget and gets truncated; codag-drain keeps the discriminating evidence and pulls ahead of both raw and Drain3.

Score vs artifact size — every incident, 3,000-line windows

x-axis · artifact tokens (log scale) · y-axis · blind-judge score · small dot = one incident, diamond = arm mean · hover for incident

Raw (red) is pinned to the right at the 20k-token budget cap (jittered horizontally for density) and its scores sag once truncation drops evidence. codag-drain (green) and Drain3 (gray) sit at ~1.5–2k tokens. Same context cost as Drain3, higher central score.

Paired comparisons · 3,000-line windows

paired over the same 80 incidents · n=80 · one-sided permutation test, α=0.05

comparison	Δ score i	95% CI i	p i	W / L / T i	Δ tokens i

The defensible launch claim, verbatim from the eval doc: codag-drain improves agent diagnosis on large, noisy windows under a fixed artifact budget, while staying competitive with raw on small windows at a fraction of the token load. Drain3 vs raw at this size is positive but not significant (p=0.066); codag-drain vs raw is.

reproduce

Run the deterministic benchmark yourself.

The Section 01 numbers come straight from the repo — one cargo test harness, no model calls, no API key. Point it at LogHub-2.0 and it prints the grouping, compression, and timing tables with the same bootstrap CIs shown above.

$ git clone https://github.com/codag-megalith/codag-drain
$ cd codag-drain
$ export LOGHUB_DIR=/path/to/loghub2   # LogHub-2.0 structured CSV root
$ bash scripts/public_benchmarks.sh   # grouping + compression + timing

View codag-drain on GitHub MIT-licensed · Rust · CI-tested

Section 02 (agent diagnosis) needs a local labeled incident corpus and gpt-5.5 access, so it is not yet a one-command public benchmark. The harness, the per-incident scores plotted above, and the full methodology live in docs/AGENT_SERVING_EVAL.md.

Run the tests yourself

Grouping quality on LogHub-2.0

Does the artifact help an agent diagnose?

Run the deterministic benchmark yourself.