RISC-V · RV32IM · single-issue · FPGA-grounded

HWE Bench

An unbounded benchmark for LLM hardware development. Models compete to design RISC-V CPU microarchitectures, measured by CoreMark fitness (Fmax × IPC) on a real Tang Nano 20K FPGA, gated by riscv-formal correctness and Python-ISS cosim.

Thesis SWE-bench tops out at 100%. HWE Bench doesn't have a top.
The fitness number reflects an actual microarchitecture, and microarchitecture has room to grow as long as models keep finding it.
Fitness vs core size

Score × LUT count

300 400 500 4.0k 6.0k 8.0k 10.0k baseline V0 · 283 · 9.6k LUT Fitness (CoreMark iter/s) LUT4 (← lower is better) gpt-5_5_xhigh 525 · 5.5k LUT gpt-5_4_xhigh 514 · 10.1k LUT gpt-5_5_high 462 · 9.8k LUT gpt-5_5_medium 432 · 7.8k LUT kimi-k2_6 396 · 9.9k LUT gemini-3_1-pro 355 · 10.2k LUT VexRiscv human ref · 370 · 4.0k LUT
Fitness (CoreMark iter/s) on Y · LUT4 cost on X · one point per model's best rep. VexRiscv (3,957 LUT4 / fitness 370) is the human-engineered reference on the same FPGA. Up-and-left is the desirable direction: more fitness for less area.
Leaderboard

Peak fitness per model

Sorted by best single-rep peak fitness · 17 reps total · VexRiscv human reference in red · baseline V0 in italic
# Model Reps Best Δ% Mean ± std Best LUT4 Best Fmax
1 gpt-5_5_xhigh 3/3 525.04 +85.6% 468.3 ± 52.8 5.5k 220
2 gpt-5_4_xhigh 2/2 513.84 +81.7% 505.0 ± 8.9 10.1k 203
3 gpt-5_5_high 3/3 461.87 +63.3% 430.2 ± 23.0 9.8k 187
4 gpt-5_5_medium 3/3 431.58 +52.6% 423.5 ± 11.2 7.8k 201
5 kimi-k2_6 2/3 396.13 +40.1% 339.5 ± 8.3 9.9k 166
6 VexRiscv (human ref) 370.00 +30.8% 4.0k 129
7 gemini-3_1-pro 3/3 354.73 +25.4% 339.4 ± 12.6 10.2k 150
8 baseline V0 (fixture) 282.82 9.6k 127

The VexRiscv row is the human-engineered reference — a well-known open-source RV32IM core, synthesized on the same Tang Nano 20K Gowin part used for the benchmark, with its bench reading scaled to CoreMark/MHz. 5 of the LLM-generated designs exceed it. Peak fitness includes reps that finalized with a failed status if their data was captured before the failure; the mean column excludes failed reps. Methodology details on the methodology page.

Why unbounded

SWE-bench saturates. HWE Bench doesn't.

Most LLM benchmarks have a fixed ceiling. SWE-bench tops out at 100% issue-resolution. Multiple-choice evals approach 99%. Once a model lands at the ceiling, every subsequent model gets the same score, and the benchmark stops being useful for tracking capability.

HWE Bench has no ceiling. The fitness score is Fmax × IPC — operating frequency times instructions-per-cycle — measured on a real FPGA. There is no theoretical maximum; better microarchitecture always scores higher. As long as models can find new tricks (deeper pipelines, smarter predictors, restructured ALUs), the leaderboard keeps moving.

Empirically: the current best is 525.04 iter/s, +85.6% over the V0 baseline core, and clear of the VexRiscv human reference. Each successive batch of reps has produced at least one design that beats the prior record. The curve has not plateaued.

Trajectory

Fitness over rounds — best rep per model

300 400 500 R0 R5 R10 R15 baseline 283 VexRiscv 370 Best fitness so far Round (1 hypothesis × 3 slots each) gpt-5_5_xhigh 525 at R15 gpt-5_4_xhigh 514 at R15 gpt-5_5_high 462 at R15 gpt-5_5_medium 432 at R15 kimi-k2_6 396 at R15 gemini-3_1-pro 355 at R15
Running max of CoreMark fitness across the 15 hypothesis rounds for each model's best-performing rep. Lines step up when a winning hypothesis lands and stay flat otherwise. VexRiscv's human-reference fitness is the red dashed line; the baseline V0 core is the gray dashed line.