![[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang](https://cloud.blankdot.ai/og-images/d720859fb188eb0a.webp)
From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the de facto standard for evaluating AI coding agents—trusted by Cognition (Devin), OpenAI, Anthropic, and every major lab racing to solve software engineering at scale.
No discussion yet. Be the first to share your thoughts!