Raw data

Every iteration, every transcript.

The full per-iteration journal and agent transcript for every rep are committed to the repository. No data is summarized away. Below is the index.

Downloads

Aggregate

bench/results.jsonl: one row per rep, structured. Schema: model, rep, status, final_fitness, best_fitness, baseline_fitness, delta_pct, iterations, accepted, rejected, broken, broken_by_class, wall_clock_sec, total_cost_usd, total_tokens_in/out, best_lut4, best_ff, best_fmax_mhz, best_iterations, best_cycles, best_ipc_coremark.
bench/leaderboard.csv: per-model aggregate (mean fitness, best, broken counts).
bench/LEADERBOARD.md: human-readable leaderboard with failure-mode breakdowns.

Per-rep

Index of all reps

17 reps
Model	Rep	Status	Iters	Best fit	Log	Transcript	Summary
gemini-3_1-pro	rep1	done	46	354.73	log.jsonl	agent.log	summary.json
gemini-3_1-pro	rep2	done	46	339.62	log.jsonl	agent.log	summary.json
gemini-3_1-pro	rep3	done	46	323.92	log.jsonl	agent.log	summary.json
gpt-5_4_xhigh	rep1	done	46	496.11	log.jsonl	agent.log	summary.json
gpt-5_4_xhigh	rep2	done	46	513.84	log.jsonl	agent.log	summary.json
gpt-5_5_high	rep1	done	46	461.87	log.jsonl	agent.log	summary.json
gpt-5_5_high	rep2	done	46	420.61	log.jsonl	agent.log	summary.json
gpt-5_5_high	rep3	done	46	408.01	log.jsonl	agent.log	summary.json
gpt-5_5_medium	rep1	done	46	431.58	log.jsonl	agent.log	summary.json
gpt-5_5_medium	rep2	done	46	407.55	log.jsonl	agent.log	summary.json
gpt-5_5_medium	rep3	done	46	431.24	log.jsonl	agent.log	summary.json
gpt-5_5_xhigh	rep1	done	46	397.83	log.jsonl	agent.log	summary.json
gpt-5_5_xhigh	rep2	done	46	525.04	log.jsonl	agent.log	summary.json
gpt-5_5_xhigh	rep3	done	46	482.03	log.jsonl	agent.log	summary.json
kimi-k2_6	rep1	done	46	347.76	log.jsonl	agent.log	summary.json
kimi-k2_6	rep2	done	46	331.22	log.jsonl	agent.log	summary.json
kimi-k2_6	rep3	failed	31	396.13	log.jsonl	agent.log	summary.json

Each log.jsonl is one row per iteration: hypothesis ID, title, outcome (improvement / regression / broken), fitness, delta vs baseline, LUT4, FF, Fmax, IPC, cycles, error class if broken, timestamp. Each agent.log is the verbatim model transcript: every bash command, every file read, every write.