From Zero to 21%: How I Taught GPT‑3.5 Turbo to Solve Codeforces Problems

Posted on 2025-11-02 Edited on 2025-11-03

I investigated how far GPT‑3.5 Turbo can go when asked to tackle competitive programming tasks in Rust from Codeforces. I treated it as a research exercise in prompt engineering, in building a “LLM‑as‑coder” workflow rather than simply firing prompts at a model. See below my key takeaways and results.

Setting the Stage

Codeforces problems are tough for LLMs because they require careful reading, algorithm design, and precise coding under small time and memory limits. To level the playing field, I built a Rust‑based evaluation harness and a Python orchestration layer called solve_code_contests_rust.py. The script manages everything from prompting to compilation, execution, and self‑debugging.

The challenge was: can GPT‑3.5 Turbo autonomously read a problem, design a solution plan, implement it in Rust, correct compilation or runtime errors, and pass the official tests? I took this challenge at the RE-Bench evaluation suite. An automated agent could achieve about 11% max as reported at the challenge page vs 21% for my workflow (see below).

Building the Pipeline

Instead of a single call to the model, I created a multi‑stage pipeline inspired by best practices in the field.
Each problem becomes a Task object that travels through the following pipeline:

Ideation – generate several possible high‑level solution ideas (think “dynamic programming”, “greedy solution”, etc.).
Critique & Implementation – pick and refine an idea, then implement it in Rust.
Self‑Debugging – compile and execute; if it fails, analyze compiler/runtime errors and re‑prompt for fixes.

A separate log file is generated for every task so that every LLM request–response pair can be inspected later. This is crucial in software development to have quick feedback cycles.

To give the model extra context, I experimented with fetching official editorials using Firecrawl’s web‑scraping API. When available, those explanations became “golden hints” that GPT could analyze and translate directly into code.

Iterating the Prompting Strategy

Early baseline runs achieved almost nothing—scores hovered around 4 % success (7 out of 165 problems). Adding chain‑of‑thought scratchpads, few‑shot examples, and limited self‑consistency sampling barely moved the needle.

The breakthrough came from structured ideation and self‑debugging. When GPT had to first describe and then critique its plan before coding, it produced cleaner algorithms and more syntactically valid Rust. For this I used this excellent survey of prompting techniques. Adding Firecrawl editorials pushed accuracy sharply upward—crossing 20 % of all problems solved. That’s still far from human performance but impressive for an automated setup running on pure text prompts.

Timeline

Day 1 – Brought the base runner online, recorded the initial 7/165 baseline, and spotted the lack of telemetry as the main blocker to making progress, I needed short feedback loops.
Day 2 – Iterated on few-shot CoT prompting and a simple self-consistency loop, nudging performance up to eight solves while realizing how fragile the gains were.
Day 3 – Refactored everything into a Task/Event pipeline, wrestled with score harness timeouts that contradicted my local logs, and began chasing Codeforces editorials despite running into HTTP 403s.
Day 4 – Patched in Firecrawl for editorial fetching, hit credit and network limits that tanked the score back to three solves, and dove deeper into prompt and retry logic4
Day 5 – Cleaned up the pipeline, tightened prompts, and added cache layers; persistent API connection errors and evaluator timeouts still made results unpredictable.
Day 6 – Stabilized the editorial path, taught the agent to grant extra compile-error retries, and finally broke through to 36/165 solves despite recurring API failures.

Practical Lessons

Pipeline, not prompts. One long instruction is brittle; stages with memory and feedback loops work better.
Clear structure. Separating prompts from prompt templates, code blocks, answer inducer, etc.
Access to editorials. With reliable editorials or code snippets, GPT‑3.5 can almost “compile by translation.”
Latency and API fragility quickly dominate—timeouts, connection errors, and rate limits are constant friction.

Final Outcome

After dozens of iterations over few days, the system peaked at 36 successes out of 165 (≈21 %)—a five‑fold improvement over the naive baseline and 2x than what an automated agent could achieve. Every successful case logged the full reasoning chain, providing valuable material for future prompt tuning and automatic grading research.

See the code in this archive, the password is “messy-red-bottle-ship” .