benchmarks
Operational decisions an oncall can take from the output
% of 20 incidents where each output explicitly surfaces the answer
Cost per incident vs log size
x-axis · log size in lines (log scale) · y-axis · $ at gpt-5.5 published rates, log scale · hover for incident detail
End-to-end latency vs log size
x-axis · log size in lines (log scale) · y-axis · compactor + agent wall time, ms
Diagnostic accuracy vs log size
x-axis · log size in lines (log scale) · y-axis · LLM-judge overall_score (0–1)
reproduce
One repo, four open-source baselines + the codag CLI. LogHub-2.0 fetched on demand, ~20 hand-labeled incidents bundled. Results land in results/latest.json.
$ git clone https://github.com/codag/codag-log-bench $ cd codag-log-bench && bash scripts/download_loghub.sh $ codag auth login # one-time browser sign-in $ python -m codag_log_bench.run --baselines all