Methodology · v1

What HWE Bench measures, and how.

Each iteration is one hypothesis → one RTL implementation → 45+ formal checks → cosim against a Python ISS → 3-seed FPGA placement → CoreMark on Verilator. A single failed gate marks the iteration as broken. No surface-metric gaming.

Score

Fitness = Fmax × IPC

The fitness score is Fmax × IPC measured on the same CoreMark workload, in iter/s.

  • Fmax — median operating frequency from 3 nextpnr seeds, placed on a Tang Nano 20K (Gowin GW2A-LV18QN88C8/I7).
  • IPC — instructions-per-cycle on CoreMark 2K with iStall+dStall backpressure, measured between start_time and stop_time markers.

The baseline V0 core scores 282.82 iter/s (Fmax = 127 MHz, LUT4 = 9,563). Every fitness number on the site is reported against this anchor.

Correctness gates

Three gates per iteration

1. Verilator lint
RTL must pass verilator --lint-only -Wall. Caught early; cheap.
2. riscv-formal
45+ .sby bounded model checks via SymbiYosys + bitwuzla, covering RV32IM instruction semantics, register-file forwarding, PC propagation, retirement uniqueness, liveness, and traps. The single-issue variant runs against formal/wrapper_si.sv; dual-issue cores run against formal/wrapper.sv. A single failed check fails the iteration.
3. Python ISS cosim
Every retirement of selftest.elf is diffed field-by-field against a Python instruction-set simulator that implements RV32IM by spec. Any divergence — wrong register write, missing trap, wrong PC — fails the iteration. CoreMark is checked separately via UART-CRC validation (CRCs match the canonical EEMBC values).

If any gate fails, the iteration is marked broken and counted on the leaderboard under broken_by_class. No score is awarded. The model gets a new slot on the next round.

Tournament

How a single rep runs

A rep is one independent tournament run with parameters N=15 (rounds) and K=3 (slots per round). Each round, the model produces 3 hypothesis YAMLs (in parallel, separate agent invocations); each hypothesis is independently implemented as RTL, evaluated through the three gates above, scored, and committed to the rep's log.jsonl. The best-fitness implementation across all 3 slots becomes the new baseline for the next round.

Three reps per model are run independently. They share no state. Each rep's final fitness is published; the model's reported peak is the maximum across reps, and the mean is averaged across completed reps (status = done).

Reproducibility

Re-run the whole bench from a fresh clone

The benchmark is reproducible from the source repository. Every per-iteration artifact is preserved:

  • bench/results.jsonl — one row per rep, structured.
  • bench/<model>/rep<N>/log.jsonl — per-iteration journal with fitness, LUT4, FF, Fmax, IPC, cycles, outcome class, and timestamp.
  • bench/<model>/rep<N>/agent.log — full model transcript. Every read, edit, bash, write tool call the agent made.
  • bench/<model>/rep<N>/summary.json — rolled-up summary with cost, wall-clock, and the best-fitness entry's microarch metadata.

The hypothesis-implementation contract is in CLAUDE.md at the repo root. The eval contract — wrapper.sv, checks.cfg, the Python ISS, the cosim harness — is in formal/, fpga/, and test/cosim/. None of these are modifiable by the agent; sandbox rolls back any iteration that touches them.