v0.1 — Pre-launch

RALPHBench: Evaluating Agents on Extremely Long-Horizon SWE Tasks

A rigorous benchmark measuring autonomous coding agents on real production-scale engineering challenges requiring sustained reasoning over 1,000–10,000+ steps.

Every capability leap needs a new benchmark.

2021
HumanEval
Function completion
~1 min per task
2023
SWE-bench
Real GitHub issues
~5–15 min per task
2025
Terminal Bench
Multi-step terminal tasks
~5–15 min per task
2026
RALPHBench
Days-long agentic work
hours to days per task

Agent Performance

Coming Soon

Pass rates across agent–model configurations on RALPHBench.

Benchmark results coming soon

#AgentWith Skills
1
Gemini CLIGemini 3 Flash
48.7%
2
Claude CodeOpus 4.5
45.3%
3
CodexGPT-5.2
44.7%
4
Claude CodeOpus 4.6
44.5%
5
Gemini CLIGemini 3 Pro
41.2%
6
Claude CodeSonnet 4.5
31.8%
7
Claude CodeHaiku 4.5
27.7%
Hover over a row to see confidence intervals and normalized gain
84 tasks · 5 trials per task · 95% CIs
Claude Code
Gemini CLI
Codex

Task Registry

Coming Soon

Tasks coming soon

compilers
hard

Build a fully functional C compiler from scratch that can compile a subset of C to x86-64 assembly, including lexer, parser, semantic analysis, and code generation phases.

#c
#llvm
#compiler
+1
Source
lsp-tooling
hard

Build a production-ready Java Language Server Protocol implementation from scratch with support for code completion, go-to-definition, find references, diagnostics, and workspace symbol search.

#java
#lsp
#language-server
+1
Source

Migrate a large-scale Vite + React SPA to Next.js App Router, converting client-side routing to file-based routing, implementing SSR/SSG where appropriate, and preserving all functionality.

#nextjs
#vite
#react
+2
Source
backend-sdks
hard

Implement a serverless backend SDK similar to Modal, with support for function decorators, GPU scheduling, container management, and distributed task execution across cloud infrastructure.

#python
#serverless
#gpu
+2
Source
full-stack-apps
hard

Build a real-time team communication platform with channels, direct messages, threads, file sharing, search, and WebSocket-based live updates across multiple workspaces.

#typescript
#websockets
#real-time
+2
Source
full-stack-apps
hard

Create a high-performance issue tracking system with keyboard-first navigation, real-time sync, custom workflows, Sprint planning, and a GraphQL API with optimistic updates.

#typescript
#graphql
#react
+2
Source