Agent evaluation across multiple QA datasets, with granular control over models, prompts, tools and memory as well as traces and metrics capturing with an agentic system built on top of smolagents.
A subset of 50 short factual questions from the smolagents/benchmark-v1 dataset. Questions ask for specific facts (names, dates, numbers) that typically require web search or Wikipedia lookup. Answers are short — usually a 4-digit year or a proper noun. Each entry includes reference URLs in true_reasoning for verification.
Example: "What year was the municipality of Ramiriquí, Boyacá, Colombia, founded?" → 1541
| Model | Agent | Score |
|---|---|---|
| Gemini-2.5-flash | Code Agent | 46/50 |
| Qwen3.5-4B | Code Agent | 45/50 |
| Nemotron-3-Super-120B-A12B-NVFP4 | Code Agent | 38/50 |
A subset of 50 competition-style math problems from the same smolagents benchmark, originally sourced from the MATH dataset (Hendrycks et al.). Problems cover algebra, number theory, geometry, and combinatorics, often with LaTeX/Asymptote diagrams. Answers are numeric. Each entry includes a full worked solution in true_reasoning.
Example: "Rick is thinking of a positive factor of 14 and Steve is thinking of a positive factor of 42. If they are thinking of the same number, how many possible numbers could they be thinking of?" → 4
| Model | Agent | Score |
|---|---|---|
| Gemini-2.5-flash | Code Agent | 50/50 |
| Qwen3.5-4B | Code Agent | 48/50 |
| Nemotron-3-Super-120B-A12B-NVFP4 | Code Agent | 50/50 |
Questions from the GAIA benchmark GAIA evaluates AI assistants on real-world tasks requiring multi-step reasoning, tool use, and file handling.
Three difficulty levels:
| Model | Agent | Score |
|---|---|---|
| o4-mini / gpt-4o / Gemini-2.5-flash | Multi-Agent | 38/53 |