SciCode Bench

About SciCode

The SciCode benchmark is a scientist-curated benchmark for scientific programming tasks utilizing Python. 65 main problems are provided with gold-standard solutions covering Chemistry, Materials Science, Biology, Math, and Physics. Each main problem is further divided into sub-problems, all of which must be solved correctly to achieve the main result. A dataset is included for verifying calculations.

We run this benchmark with “background” enabled, meaning the model is given a detailed description of the mathematics and theory for the problem to be solved. Enabling background is not the “standard setup” (per the cited paper), however this choice gives small models a better chance to compete by shifting the emphasis from general knowledge of scientific concepts toward scientific instruction following.

Results

Note: These findings are based on internal research and do not constitute an endorsement of any model for any specific purpose. This list will be updated frequently - please check back here for updates.

SciCode Benchmark

Model% CorrectCostNoteConfigDate
gemini-3-flash-high24.6%$9.77Gemini 3 Flash reasoning_effort=“high”2026-01-23
claude-opus-4-5-high24.6%$26.37Opus 4.5 reasoning_effort=“high”2025-12-02
gemini-3.0-pro-preview21.5%$27.63Gemini 3.0 Pro Preview2025-12-02
gemini-3-flash18.5%$0.28Gemini 3 Flash2026-01-23
gpt-5-1-codex-max18.5%$7.72OpenAI gpt-5-1-codex2026-01-23
claude-opus-4-518.5%$10.78Opus 4.52025-12-02
gemini-2.5-pro-high16.9%$12.60Gemini 2.5 Pro reasoning_effort=“high”2025-08-22
gpt-5-high16.9%$16.80OpenAI gpt-5 reasoning_effort=“high”2025-09-26
gpt-5.215.4%$3.87OpenAI GPT 5.22026-01-23
glm-4.715.4%$7.97⚠️GLM 4.7 on GCP Vertex AI2026-01-23
claude-sonnet-4-5-high15.4%$11.25Sonnet 4.5 reasoning_budget=81922025-09-30
claude-sonnet-4-513.8%$5.47Sonnet 4.52025-09-30
gpt-5-codex13.8%$14.35OpenAI gpt-5-codex2025-09-26
claude-sonnet-4-high13.8%$10.49Sonnet 4.0 reasoning_budget=81922025-08-22
gemini-2.5-pro13.8%$25.33Gemini 2.5 Pro2025-08-22
gpt-5-1-codex13.8%>$40😢OpenAI gpt-5-1-codex (unable to complete)2025-12-02
grok-4-070913.8%$7.13😢xAI Grok 4 0709 (unable to complete)2025-08-24
gpt-5-nano-high12.3%$1.09OpenAI gpt-5-nano reasoning_effort=“high”2025-09-26
qwen-3-coder12.3%$1.50⚠️qwen3-coder-480b-a35b-instruct2025-08-22
kimi-k2-thinking12.3%$4.85⚠️qwen3-coder-480b-a35b-instruct2025-08-22
claude-sonnet-412.3%$5.32Sonnet 4.0 reasoning=false2025-08-22
claude-opus-4-112.3%$28.85Opus 4.1 reasoning=false2025-08-22
claude-opus-4-1-high12.3%$45.00Opus 4.1 reasoning_budget=81922025-08-22
deepseek-3.210.8%$0.76⚠️Deepseek 3.2 on GCP Vertex AI2026-01-23
gpt-oss-120b-high10.8%$0.36gpt-oss-120b reasoning_effort=“high”2025-08-24
grok-3-mini10.8%$0.94xAI Grok 3 Mini2025-08-29
grok-code-fast-110.8%$2.09xAI Code Fast 12025-08-29
gpt-5-mini-high10.8%$3.49OpenAI gpt-5-mini reasoning_effort=“high”2025-09-26
o3-high10.8%$6.14OpenAI o3 reasoning_effort=“high”2025-08-22
gpt-oss-20b-high9.3%$0.19gpt-oss-20b reasoning_effort=“high”2025-08-24
o4-mini9.3%$3.21OpenAI o4-mini reasoning_effort=“high”2025-08-22
haiku-4-5-high9.3%$4.25Anthropic Haiku 4.5 reasoning_effort=“high”2025-10-17
gpt-oss-20b6.1%$0.14gpt-oss-20b2025-08-22
llama-4-maverick7.6%$0.24Vertex AI Llama 4 Maverick2025-08-24
gpt-oss-120b7.6%$0.35gpt-oss-120b2025-08-22
gpt-5-mini7.6%$1.20OpenAI gpt-5-mini2025-08-22
haiku-4-57.6%$1.99Anthropic Haiku 4.52025-10-17
gemini-2.5-flash-high7.6%$3.12Gemini 2.5 Flash reasoning_effort=“high”2025-08-22
grok-37.6%$3.96xAI Grok 32025-08-23
gpt-57.6%$9.60OpenAI gpt-5 non-reasoning2025-08-22
gpt-5-nano6.1%$0.47OpenAI gpt-5-nano2025-08-22
deepseek-r1-05286.1%$10.00⛔ ⚠️DeepSeek R1-0528 on GCP Vertex AI2025-08-23
haiku-3-53%$1.72Claude 3.5 Haiku2025-08-22
codestral1.5%$0.21Mistral AI Codestral2025-08-22
llama-4-scout0.0%$0.24Vertex AI Llama 4 Scout2025-08-22

Discontinued Models

Model% CorrectCostNoteConfigDate
azure/deepseek-r110.8%$10.00⛔ ⚠️Azure Foundry MSAI-DS-R12025-08-23

Explanation of Notes

✅: Pareto-frontier optimal model (best cost-performance ratio in its performance tier - green dashed line in plot).

⛔: Unusually expensive for performance tier. Consider alternatives.

😢: Model was unable to complete benchmark (e.g. infinite loop in reasoning stage). Results are estimated.

⚠️: May contain censorship and/or has weak gaurdrails compared to alternatives. Use with caution.

Current CBorg Model Mappings

lbl/cborg-coder: gpt-oss-120b-high

lbl/cborg-deepthought: gpt-oss-120b-high

lbl/cborg-mini: gpt-oss-20b-high

lbl/cborg-chat: llama-4-scout

Changelog

Dec 2nd, 2025

  • Added results for Claude Opus 4.5 and Gemini 3.0 Pro Preview

Sept 30th, 2025

  • Added results for Claude Sonnet 4.5 standard and reasoning_effort=high
  • Updated green checkmarks to indicate Pareto-frontier models only

Sept 26th, 2025

  • Added results for gpt-5 variants with reasoning_effort=“high” configuration
  • gpt-5-high tied with gemini-2.5-pro-high for highest score
  • Added gpt-5-codex result

Aug 29th, 2025

  • Re-calculated Grok 3 Mini costs based on xAI API costs (approx 50% cheaper than Azure Foundry costs)
  • Added results for new Grok Code Fast 1 model

Aug 24, 2025

  • Corrected error in gpt-oss benchmarks with -high setting and updated scores
  • Added xAI Grok 4 0709 - Model was unable to complete 4 problem sets after 2 hours of retries.
  • Added Meta Llama 4 models.

Aug 23, 2025

  • Added Azure MSAI-DS-R1 (DeepSeek R1 post-trained by Microsoft AI)
  • R-ran Qwen 3 Coder benchmark to verify cost (no change)
  • Added xAI Grok 3. Grok 3 Mini was unable to complete benchmark - will try again
  • Added DeepSeek R1 0528 from Vertex AI