Agent Benchmark

Agent evaluation across multiple QA datasets, with granular control over models, prompts, tools and memory as well as traces and metrics capturing with an agentic system built on top of smolagents.

Simple QA

A subset of 50 short factual questions from the smolagents/benchmark-v1 dataset. Questions ask for specific facts (names, dates, numbers) that typically require web search or Wikipedia lookup. Answers are short — usually a 4-digit year or a proper noun. Each entry includes reference URLs in true_reasoning for verification.

Example: "What year was the municipality of Ramiriquí, Boyacá, Colombia, founded?" → 1541

Simple QA answer sheet

Results

Model Agent Score
Gemini-2.5-flash Code Agent 46/50
Qwen3.5-4B Code Agent 45/50
Nemotron-3-Super-120B-A12B-NVFP4 Code Agent 38/50

Simple Math

A subset of 50 competition-style math problems from the same smolagents benchmark, originally sourced from the MATH dataset (Hendrycks et al.). Problems cover algebra, number theory, geometry, and combinatorics, often with LaTeX/Asymptote diagrams. Answers are numeric. Each entry includes a full worked solution in true_reasoning.

Example: "Rick is thinking of a positive factor of 14 and Steve is thinking of a positive factor of 42. If they are thinking of the same number, how many possible numbers could they be thinking of?" → 4

Simple Math answer sheet

Results

Model Agent Score
Gemini-2.5-flash Code Agent 50/50
Qwen3.5-4B Code Agent 48/50
Nemotron-3-Super-120B-A12B-NVFP4 Code Agent 50/50

GAIA-1

Questions from the GAIA benchmark GAIA evaluates AI assistants on real-world tasks requiring multi-step reasoning, tool use, and file handling.

Three difficulty levels:

  • Level 1: Generally requires 1–2 tools and few reasoning steps
  • Level 2: Requires combining multiple tools with more complex reasoning
  • Level 3: Requires long chains of actions, advanced reasoning, and diverse tool use
GAIA-1 answer sheet

Results

Model Agent Score
o4-mini / gpt-4o / Gemini-2.5-flash Multi-Agent 38/53