Blog · arXiv Analysis · Last reviewed July 2, 2026

The Python Score Becomes the Multilingual Trap

Maria Ivanova, Pavel Zadorozhny, Rodion Levichev, Ivan Petrov, Adamenko Pavel, Ivan Lopatin, Alexey Kutalev, and Dmitrii Babaev's Multi-LCB paper is useful because it tests whether a code model's Python score survives contact with the rest of software.

For this essay, a multilingual benchmark receipt is the record that binds a task, language, compiler, prompt, model, cutoff date, sampling setting, hidden tests, pass result, contamination window, and failure mode into one auditable coding-evaluation event.

The Claim

The paper, arXiv:2606.20517 [cs.AI, cs.PL], was submitted on June 18, 2026 and listed with the comment "ICLR 2026." It introduces Multi-LCB, a multilingual extension of LiveCodeBench for code-generation evaluation.

LiveCodeBench matters because it continuously adds competitive-programming tasks and filters by release date, making it harder for old benchmark contamination to masquerade as coding ability. The limitation is that LCB is Python-only. Multi-LCB asks whether a Python result is a reliable proxy for coding in C++, Java, Go, Rust, JavaScript, TypeScript, C#, Ruby, PHP, Kotlin, Scala, and Python itself.

The answer is no. The paper reports evidence of Python overfitting, language-specific contamination, and large cross-language performance gaps across 24 publicly available instruction and reasoning models.

Benchmark Construction

Multi-LCB keeps the LiveCodeBench task pool, release-date metadata, and hidden-test evaluation style. It loads LCB code-generation problems from Hugging Face and preserves tasks from LeetCode, AtCoder, and Codeforces.

The central engineering move is to unify evaluation around STDIN/STDOUT. AtCoder and Codeforces tasks already use this form. LeetCode's functional tasks are converted so examples and hidden tests can be evaluated through the same input/output harness across languages.

The authors report manual inspection of about 500 tasks and say they found no cases where language-dependent features introduced inconsistencies. That does not make conversion risk disappear, but it makes the benchmark more disciplined than a pile of hand-translated function signatures.

Each model receives a zero-shot prompt that names the target language and asks for a complete program reading from standard input and writing to standard output. Correctness is Pass@1: the first generated solution must compile or interpret successfully and pass all hidden official tests without runtime errors or timeouts.

Language Coverage

The twelve languages are C++, C#, Python, Java, Rust, Go, TypeScript, JavaScript, Ruby, PHP, Kotlin, and Scala. The paper frames this set as a compromise among popularity, stable infrastructure, and programming-paradigm diversity.

That diversity matters. The benchmark spans compiled or JIT languages, interpreted languages, a transpiled language, static and dynamic typing, native runtimes, JVM and .NET targets, JavaScript engines, garbage-collected systems, C++'s manual or RAII discipline, and Rust's ownership model.

This is exactly where a Python-only score can mislead. A model may solve algorithmic problems in Python while failing on imports, type signatures, ownership, input parsing, compiler errors, or resource limits elsewhere.

The Results

The main table evaluates 24 recent public models on tasks released from February 2025 through May 2025, using temperature 0.2, top-p 0.95, and Pass@1 averaged over 10 runs. The paper says this post-February 2025 filtering is intended to reduce training-data leakage risk.

The top average scores on the 12-language set are still far from saturation: GPT-OSS-120B* (Medium) at 67.8% average Pass@1, Qwen3-235B-A22B-Thk-2507* at 64.0%, and DeepSeek-R1-0528* at 63.1%. The next tiers fall quickly, and the paper notes that most evaluated models remain below 40% average Pass@1.

The language gap is the finding. Python has the highest mean Pass@1 at 0.482. Java and C++ sit near 0.44. C#, Ruby, PHP, Go, Rust, Kotlin, JavaScript, and TypeScript form a middle tier near 0.33-0.39. Scala trails below 0.29.

The paper also compares reproduced Python results against official LCB results and reports a mean absolute deviation of about 3 percentage points, with Qwen3-235B-A22B-Thinking-2507 at 74.0% versus 74.1% on the original v6 leaderboard. That check matters because it argues that Multi-LCB's Python path is not artificially harder than ordinary LCB.

Contamination

Multi-LCB inherits LCB's release-date discipline, but the paper does not pretend date filtering solves contamination. Its time-wise analysis finds higher scores on older pre-cutoff problems and step-like drops when evaluation crosses model cutoffs.

This is the right posture. A live benchmark is not magic. It is a moving window with a contamination hypothesis attached. The benchmark becomes more useful when it preserves task dates, model cutoff assumptions, and versioned result windows instead of compressing everything into one timeless leaderboard number.

For code agents, that distinction is essential. A model that solved old contest tasks because similar solutions were in training data has not proven it can maintain a new production codebase, debug a live incident, or port a library across a language ecosystem.

What the Score Hides

Pass@1 is a hard and useful test, but it hides the failure anatomy unless the run preserves error types. The appendix reports that wrong-answer errors dominate across languages, while compiled languages show more compiler and type-related failures. Java, C#, and Go see more runtime exceptions tied to explicit input parsing. Timeout and resource failures appear more often in slower languages and reasoning-tuned models.

Those categories are not trivia. They tell different deployment stories. A wrong-answer failure points to algorithmic misunderstanding. A compiler error points to syntax, typing, library, or boilerplate weakness. A timeout points to inefficiency. An input-parsing failure points to interface discipline.

A serious coding-agent evaluation should not stop at "failed." It should say whether the model reasoned wrong, wrote invalid code, mishandled the runtime, misunderstood I/O, or used too much time.

Governance Reading

The Spiralist reading is that a Python leaderboard can become an institutional hallucination. It looks like a coding score, but it may really be a Python score, a contest-task score, a cutoff-window score, or a prompt-harness score.

Multi-LCB makes that ambiguity inspectable. It does not prove that a model is a competent software engineer. It proves that coding evaluation must name the language, runtime, task family, release window, hidden tests, execution harness, and contamination policy before anyone turns a benchmark score into procurement, deployment, or staffing evidence.

For real engineering work, the benchmark should sit beside repository maintenance tasks, dependency upgrades, test repair, code review, security fixes, documentation updates, and incident-driven debugging. A contest program is one useful slice. It is not the whole job.

Benchmark Receipts

A multilingual benchmark receipt should include the task ID, source platform, release date, original LCB version, converted prompt, language target, compiler or interpreter version, sandbox limits, model checkpoint, cutoff estimate, sampling settings, number of runs, Pass@1 result, stderr/stdout trace, timeout status, and failure class.

The leaderboard receipt should publish the aggregation window, language weights, per-language scores, excluded tasks, failed conversions, reproduction deltas against LCB, and any known contamination risk. Averages should never travel without language-level detail.

The deployment receipt should map benchmark slices to the actual target work. If the tool will edit Java services, Rust crates, TypeScript frontends, PHP backends, or C# enterprise code, the evaluation should prove competence in those environments, not only in Python.

Limits

The paper's own limitations are important. Multi-LCB covers 12 languages but not Swift, Haskell, R, Julia, and other specialized or emerging languages. Its selection follows 2025 popularity rankings and infrastructure feasibility, not every domain's needs.

The tasks are still competitive-programming tasks. They do not directly test API integration, legacy debugging, collaborative development, dependency management, package ecosystems, security remediation, pull-request review, or long-lived repository maintenance.

The strict STDIN/STDOUT protocol may introduce failures from format compliance, input parsing, or output specification rather than pure algorithmic reasoning. Automatic functional-to-STDIN/STDOUT conversion may also change task complexity unevenly across languages.

The safe reading is: Multi-LCB is a better coding benchmark receipt than Python-only LCB for multilingual competence, but it is still a benchmark of contest-style programs under a particular execution harness.

Source Discipline

This page treats the arXiv abstract, arXiv HTML, PDF, and the linked GitHub repository as the source set. The PDF was used for language coverage, model count, experimental protocol, headline results, contamination discussion, limitations, artifact licensing, runtime versions, and execution-cost details.

The GitHub repository was used to verify that the artifact is public, lists the 12 supported languages, provides installation and run instructions, links the leaderboard, and exposes code, tests, requirements, and runner scripts. I did not independently rerun Multi-LCB.