Each iteration is one hypothesis → one RTL implementation → 45+ formal checks → cosim against a Python ISS → 3-seed FPGA placement → CoreMark on Verilator. A single failed gate marks the iteration as broken. No surface-metric gaming.
The fitness score is Fmax × IPC measured on the same CoreMark workload, in iter/s.
The baseline V0 core scores 282.82 iter/s (Fmax = 127 MHz, LUT4 = 9,563). Every fitness number on the site is reported against this anchor.
If any gate fails, the iteration is marked broken and counted on the
leaderboard under broken_by_class. No score is awarded. The model gets a
new slot on the next round.
A rep is one independent tournament run with parameters
N=15 (rounds) and K=3 (slots per round).
Each round, the model produces 3 hypothesis YAMLs (in parallel, separate agent
invocations); each hypothesis is independently implemented as RTL, evaluated through the
three gates above, scored, and committed to the rep's log.jsonl. The
best-fitness implementation across all 3 slots becomes the new baseline for the next
round.
Three reps per model are run independently. They share no state. Each rep's final fitness
is published; the model's reported peak is the maximum across reps, and the mean is
averaged across completed reps (status = done).
The benchmark is reproducible from the source repository. Every per-iteration artifact is preserved:
bench/results.jsonl — one row per rep, structured.bench/<model>/rep<N>/log.jsonl — per-iteration journal with
fitness, LUT4, FF, Fmax, IPC, cycles, outcome class, and timestamp.bench/<model>/rep<N>/agent.log — full model transcript.
Every read, edit, bash, write
tool call the agent made.bench/<model>/rep<N>/summary.json — rolled-up summary with
cost, wall-clock, and the best-fitness entry's microarch metadata.
The hypothesis-implementation contract is in CLAUDE.md at the repo root.
The eval contract — wrapper.sv, checks.cfg, the Python ISS, the cosim harness — is in
formal/, fpga/, and test/cosim/. None of these
are modifiable by the agent; sandbox rolls back any iteration that touches them.