Task-Specific LLM Evals that Do & Don’t Work — Blankdot