HWE Bench — RISC-V CPU design benchmark for LLMs

HWE Bench

An unbounded benchmark for LLM hardware development. Models compete to design RISC-V CPU microarchitectures, measured by CoreMark fitness (Fmax × IPC) on a real Tang Nano 20K FPGA, gated by riscv-formal correctness and Python-ISS cosim.

Thesis SWE-bench tops out at 100%. HWE Bench doesn't have a top.
The fitness number reflects an actual microarchitecture, and microarchitecture has room to grow as long as models keep finding it.

Score × LUT count

Fitness (CoreMark iter/s) on Y · LUT4 cost on X · one point per model's best rep. VexRiscv (3,957 LUT4 / fitness 370) is the human-engineered reference on the same FPGA. Up-and-left is the desirable direction: more fitness for less area.

Peak fitness per model

Sorted by best single-rep peak fitness · 17 reps total · VexRiscv human reference in red · baseline V0 in italic
#	Model	Reps	Best	Δ%	Mean ± std	Best LUT4	Best Fmax
1	gpt-5_5_xhigh	3/3	525.04	+85.6%	468.3 ± 52.8	5.5k	220
2	gpt-5_4_xhigh	2/2	513.84	+81.7%	505.0 ± 8.9	10.1k	203
3	gpt-5_5_high	3/3	461.87	+63.3%	430.2 ± 23.0	9.8k	187
4	gpt-5_5_medium	3/3	431.58	+52.6%	423.5 ± 11.2	7.8k	201
5	kimi-k2_6	2/3	396.13	+40.1%	339.5 ± 8.3	9.9k	166
6	VexRiscv (human ref)	—	370.00	+30.8%	—	4.0k	129
7	gemini-3_1-pro	3/3	354.73	+25.4%	339.4 ± 12.6	10.2k	150
8	baseline V0 (fixture)	—	282.82	—	—	9.6k	127

The VexRiscv row is the human-engineered reference — a well-known open-source RV32IM core, synthesized on the same Tang Nano 20K Gowin part used for the benchmark, with its bench reading scaled to CoreMark/MHz. 5 of the LLM-generated designs exceed it. Peak fitness includes reps that finalized with a failed status if their data was captured before the failure; the mean column excludes failed reps. Methodology details on the methodology page.

SWE-bench saturates. HWE Bench doesn't.

Most LLM benchmarks have a fixed ceiling. SWE-bench tops out at 100% issue-resolution. Multiple-choice evals approach 99%. Once a model lands at the ceiling, every subsequent model gets the same score, and the benchmark stops being useful for tracking capability.

HWE Bench has no ceiling. The fitness score is Fmax × IPC — operating frequency times instructions-per-cycle — measured on a real FPGA. There is no theoretical maximum; better microarchitecture always scores higher. As long as models can find new tricks (deeper pipelines, smarter predictors, restructured ALUs), the leaderboard keeps moving.

Empirically: the current best is 525.04 iter/s, +85.6% over the V0 baseline core, and clear of the VexRiscv human reference. Each successive batch of reps has produced at least one design that beats the prior record. The curve has not plateaued.

Fitness over rounds — best rep per model

Running max of CoreMark fitness across the 15 hypothesis rounds for each model's best-performing rep. Lines step up when a winning hypothesis lands and stay flat otherwise. VexRiscv's human-reference fitness is the red dashed line; the baseline V0 core is the gray dashed line.