RALPHBench: Evaluating Agents on Extremely Long-Horizon SWE Tasks

A rigorous benchmark measuring autonomous coding agents on real production-scale engineering challenges requiring sustained reasoning over 1,000–10,000+ steps.

Contribute Browse Tasks GitHub

Every capability leap needs a new benchmark.

2021

HumanEval

Function completion

~1 min per task

unlocked code LLMs

2023

SWE-bench

Real GitHub issues

~5–15 min per task

unlocked coding agents

2025

Terminal Bench

Multi-step terminal tasks

~5–15 min per task

unlocked terminal agents

2026

RALPHBench

Days-long agentic work

hours to days per task

unlocking autonomous agents

Agent Performance

Coming Soon

Pass rates across agent–model configurations on RALPHBench. Results will be published after initial benchmark runs.

Benchmark results coming soon

#	Agent	Without	With Skills	Δ
1	Gemini CLIGemini 3 Flash	31.3%	48.7%	+17.4
2	Claude CodeOpus 4.5	22.0%	45.3%	+23.3
3	CodexGPT-5.2	30.6%	44.7%	+14.1
4	Claude CodeOpus 4.6	30.6%	44.5%	+13.9
5	Gemini CLIGemini 3 Pro	27.6%	41.2%	+13.6
6	Claude CodeSonnet 4.5	17.3%	31.8%	+14.5
7	Claude CodeHaiku 4.5	11.0%	27.7%	+16.7

Hover over a row to see confidence intervals and normalized gain

84 tasks · 5 trials per task · 95% CIsWithoutWith Skills

Claude Code

Gemini CLI

Codex

Task Registry

Coming Soon

Tasks coming soon

c-compiler

compilers

hard

Build a fully functional C compiler from scratch that can compile a subset of C to x86-64 assembly, including lexer, parser, semantic analysis, and code generation phases.

#llvm

#compiler

Source

java-lsp-server

lsp-tooling

hard

Build a production-ready Java Language Server Protocol implementation from scratch with support for code completion, go-to-definition, find references, diagnostics, and workspace symbol search.

#java

#lsp

#language-server

Source

vite-to-nextjs-migration

frameworks

hard

Migrate a large-scale Vite + React SPA to Next.js App Router, converting client-side routing to file-based routing, implementing SSR/SSG where appropriate, and preserving all functionality.

#nextjs

#vite

#react

Source

modal-backend-sdk

backend-sdks

hard

Implement a serverless backend SDK similar to Modal, with support for function decorators, GPU scheduling, container management, and distributed task execution across cloud infrastructure.

#python

#serverless

#gpu

Source

slack-clone

full-stack-apps

hard

Build a real-time team communication platform with channels, direct messages, threads, file sharing, search, and WebSocket-based live updates across multiple workspaces.

#typescript

#websockets

#real-time

Source

linear-clone

full-stack-apps

hard

Create a high-performance issue tracking system with keyboard-first navigation, real-time sync, custom workflows, Sprint planning, and a GraphQL API with optimistic updates.

#typescript

#graphql

#react

Source

View All Tasks