I don’t know about “excellence” with my score of 70, but hey, I did it. It was my first AI-focused project, as opposed to using AI in software development. The final assignment was indeed an actual project: build out a basic template into an agent that can answer questions from the GAIA benchmark.
The task was simple enough: your agent gets 20 questions in natural language, and needs to answer them with an exact (string-matching) factual answer. The first question, for instance, is
How many studio albums were published by Mercedes Sosa between 2000 and 2009 (included)? You can use the latest 2022 version of english wikipedia.
A side remark: while I think that’s a really interesting and relevant challenge, one can debate how agentic it is. Answering questions is what LLMs do. This agent doesn’t effect any changes in the real world. I’m fine calling it an agent, though, because we need to use tools to be successful in this challenge, and the AI decides which ones to use. The agent can also interpret the LLM’s output and decide to ask follow-up questions, or try a different approach. That’s the autonomy it has. Anyway, “agent” is the hot term of the season, so let’s accept it and move on.
The Hugging Face template is a Python program that can retrieve the questions, give them to the agent, and submit the results, so the boilerplate is mostly taken care of. I added the option to answer only the first N questions so we don’t need to wait for all of them while the baby agent can’t answer any yet.
For the agent framework, I went with smolagents. LangGraph seems to be the industry standard but I liked the idea of sticking with something small and simple. Both are covered in the course. It went pretty well, except at the beginning where I got confused about which agent class to use. CodeAgent was the answer. Other than that, the docs are good, especially the tutorials such as Building good agents.
For the model, I experimented with different ones and built a way to select one via an environment variable:
provider_configs = {
"anthropic": ("Anthropic", "ANTHROPIC_MODEL", "claude-sonnet-4-5-20250929", "ANTHROPIC_API_KEY", None),
"gemini": ("Google Gemini", "GEMINI_MODEL", "gemini/gemini-2.5-pro", "GEMINI_API_KEY", None),
"ollama": ("Ollama", "OLLAMA_MODEL", "ollama_chat/deepseek-r1:70b", None, "OLLAMA_API_BASE"),
}
deepseek-r1:70b running locally via Ollama did really well, but in the end I used Gemini because it can natively watch and understand videos given a YouTube URL, which at least one question required.
The meat of the agent is then essentially this, and the tools referenced here:
self.agent = CodeAgent(tools=[DuckDuckGoSearchTool(),
BrowserLikeVisitTool(),
fetch_wikipedia_page,
fetch_wikipedia_sections,
fetch_wikipedia_section,
youtube_visual_analysis_tool],
model=model,
description="smolagents CodeAgent with DuckDuckGo, "
"Wikipedia, YouTube visual analysis, and base tools",
instructions=instructions,
additional_authorized_imports=[
"ddgs",
"markdownify",
"requests",
"pandas",
"io", # For BytesIO to handle file bytes
"openpyxl", # For Excel file support
"PIL", # For image processing
],
planning_interval=1)
Tell the agent what tools it can use and give it additional instructions. The instructions are, in my case, simply a hard-coded text of about 400 words. They explain two things: how to use each tool and how to formulate the response. A detailed step-by-step guide was necessary for most tools.
The tools themselves I wrote with heavy Claude Code support - it is an AI challenge after all. Given that I’m not too fluent in Python or Wikipedia internals, that was invaluable.
And that’s it. 1,500 lines of Python to answer 70% of the questions. Some takeaways: