SciCode Bench

About SciCode

The SciCode benchmark is a scientist-curated benchmark for scientific programming tasks utilizing Python. 65 main problems are provided with gold-standard solutions covering Chemistry, Materials Science, Biology, Math, and Physics. Each main problem is further divided into sub-problems, all of which must be solved correctly to achieve the main result. A dataset is included for verifying calculations.

We run this benchmark with “background” enabled, meaning the model is given a detailed description of the mathematics and theory for the problem to be solved. Enabling background is not the “standard setup” (per the cited paper), however this choice gives small models a better chance to compete by shifting the emphasis from general knowledge of scientific concepts toward scientific instruction following.

Results

Note: These findings are based on internal research and do not constitute an endorsement of any model for any specific purpose. This list will be updated frequently - please check back here for updates.

SciCode Benchmark

Model% CorrectCostNoteConfigDate
google/gemini-pro-high16.9%$12.60✅ 💰Gemini 2.5 Pro reasoning_effort=“high”2025-08-22
openai/gpt-5-high16.9%$16.80✅ 💰OpenAI gpt-5 reasoning_effort=“high”2025-09-26
openai/gpt-5-codex13.8%$14.35✅ 💰OpenAI gpt-5-codex2025-09-26
anthropic/claude-sonnet-high13.8%$10.49✅ 💰Sonnet 4.0 reasoning_budget=81922025-08-22
google/gemini-pro13.8%$25.33Gemini 2.5 Pro2025-08-22
xai/grok-4-070913.8%$7.13😢xAI Grok 4 0709 (unable to complete)2025-08-24
openai/gpt-5-nano-high12.3%$1.09OpenAI gpt-5-nano reasoning_effort=“high”2025-09-26
anthropic/claude-sonnet12.3%$5.32Sonnet 4.0 reasoning=false2025-08-22
anthropic/claude-opus-high12.3%$45.00Opus 4.1 reasoning_budget=81922025-08-22
anthropic/claude-opus12.3%$28.85Opus 4.1 reasoning=false2025-08-22
gcp/qwen-3-coder12.3%$1.50⚠️qwen3-coder-480b-a35b-instruct2025-08-22
openai/o3-high10.8%$6.14OpenAI o3 reasoning_effort=“high”2025-08-22
openai/gpt-5-mini-high10.8%$3.49OpenAI gpt-5-mini reasoning_effort=“high”2025-09-26
azure/deepseek-r110.8%$10.00⚠️Azure Foundry MSAI-DS-R12025-08-23
xai/grok-code-fast-110.8%$2.09xAI Code Fast 12025-08-29
xai/grok-3-mini10.8%$0.94xAI Grok 3 Mini2025-08-29
gcp/gpt-oss-120b-high10.8%$0.36gpt-oss-120b reasoning_effort=“high”2025-08-24
gcp/gpt-oss-20b-high9.3%$0.19gpt-oss-20b reasoning_effort=“high”2025-08-24
openai/o4-mini9.3%$3.21OpenAI o4-mini reasoning_effort=“high”2025-08-22
openai/gpt-57.6%$9.60OpenAI gpt-52025-08-22
openai/gpt-5-mini7.6%$1.20OpenAI gpt-5-mini2025-08-22
google/gemini-flash-high7.6%$3.12Gemini 2.5 Flash reasoning_effort=“high”2025-08-22
xai/grok-37.6%$3.96xAI Grok 32025-08-23
meta/llama-4-maverick7.6%$0.24Vertex AI Llama 4 Maverick2025-08-24
gcp/gpt-oss-120b7.6%$0.35gpt-oss-120b2025-08-22
gcp/gpt-oss-20b6.1%$0.14gpt-oss-20b2025-08-22
openai/gpt-5-nano6.1%$0.47OpenAI gpt-5-nano2025-08-22
gcp/deepseek-r16.1%$10.00⛔ ⚠️Vertex AI DeepSeek R1-05282025-08-23
anthropic/claude-haiku3%$1.72Claude 3.5 Haiku2025-08-22
gcp/codestral1.5%$0.21Mistral AI Codestral2025-08-22
meta/llama-4-scout0.0%$0.24Vertex AI Llama 4 Scout2025-08-22

Explanation of Notes

✅: Good cost-performance efficiency in performance tier - recommended model.

⛔: Unusually expensive for performance tier. Consider alternatives.

😢: Model was unable to complete benchmark (e.g. infinite loop in reasoning stage). Results are estimated.

💰: Due to high cost, the model is best reserved for high-complexity tasks.

⚠️: May contain censorship and/or has weak gaurdrails compared to alternatives. Use with caution.

Current CBorg Model Mappings

lbl/cborg-coder: gpt-oss-120b-high

lbl/cborg-deepthought: gpt-oss-120b-high

lbl/cborg-mini: gpt-oss-20b-high

lbl/cborg-chat: llama-4-scout

Changelog

Sept 26th, 2026

  • Added results for gpt-5 variants with reasoning_effort=“high” configuration
  • gpt-5-high tied with gemini-2.5-pro-high for highest score
  • Added gpt-5-codex result

Aug 29th, 2025

  • Re-calculated Grok 3 Mini costs based on xAI API costs (approx 50% cheaper than Azure Foundry costs)
  • Added results for new Grok Code Fast 1 model

Aug 24, 2025

  • Corrected error in gpt-oss benchmarks with -high setting and updated scores
  • Added xAI Grok 4 0709 - Model was unable to complete 4 problem sets after 2 hours of retries.
  • Added Meta Llama 4 models.

Aug 23, 2025

  • Added Azure MSAI-DS-R1 (DeepSeek R1 post-trained by Microsoft AI)
  • R-ran Qwen 3 Coder benchmark to verify cost (no change)
  • Added xAI Grok 3. Grok 3 Mini was unable to complete benchmark - will try again
  • Added DeepSeek R1 0528 from Vertex AI