RALPHBench: Evaluating Agents on Extremely Long-Horizon SWE Tasks
A rigorous benchmark measuring autonomous coding agents on real production-scale engineering challenges requiring sustained reasoning over 1,000–10,000+ steps.
Every capability leap needs a new benchmark.
Agent Performance
Coming SoonPass rates across agent–model configurations on RALPHBench. Results will be published after initial benchmark runs.
Benchmark results coming soon
| # | Agent | With Skills |
|---|---|---|
| 1 | Gemini CLIGemini 3 Flash | 48.7% |
| 2 | Claude CodeOpus 4.5 | 45.3% |
| 3 | CodexGPT-5.2 | 44.7% |
| 4 | Claude CodeOpus 4.6 | 44.5% |
| 5 | Gemini CLIGemini 3 Pro | 41.2% |
| 6 | Claude CodeSonnet 4.5 | 31.8% |
| 7 | Claude CodeHaiku 4.5 | 27.7% |
Task Registry
Coming SoonTasks coming soon
Build a fully functional C compiler from scratch that can compile a subset of C to x86-64 assembly, including lexer, parser, semantic analysis, and code generation phases.
Build a production-ready Java Language Server Protocol implementation from scratch with support for code completion, go-to-definition, find references, diagnostics, and workspace symbol search.
Migrate a large-scale Vite + React SPA to Next.js App Router, converting client-side routing to file-based routing, implementing SSR/SSG where appropriate, and preserving all functionality.
Implement a serverless backend SDK similar to Modal, with support for function decorators, GPU scheduling, container management, and distributed task execution across cloud infrastructure.
Build a real-time team communication platform with channels, direct messages, threads, file sharing, search, and WebSocket-based live updates across multiple workspaces.
Create a high-performance issue tracking system with keyboard-first navigation, real-time sync, custom workflows, Sprint planning, and a GraphQL API with optimistic updates.