Crafter Station benchmark
Which agent writes better modern CSS?
Blind A/B voting across 60 current CSS challenges. Compare vanilla Codex outputs against Codex with css-bash context, then inspect the examples and evidence behind the result.
62 votes
/ 2 voters
/ last vote 39d ago
Benchmark console
60 tasks, 120 outputs
Examples
Inspect both outputs
Browse rounds, prompts, judge verdicts and rendered HTML without signing in.
Community
See aggregate votes
Track css-bash, vanilla and tie preferences across every challenge.
Evidence
Read the model runs
Check the AI Gateway batches and long-run evidence from the experiment.