The METR evals for Gemini 3.0 and Opus 4.5 are taking incredibly long--GPT 5.1 codex max was benchmarked almost instantly as well as others. Why is that?

The METR evals for Gemini 3.0 and Opus 4.5 are taking incredibly long--GPT 5.1 codex max was benchmarked almost instantly as well as others. Why is that? — Blankdot

Command Palette

Command Palette