Why LLMs Need Agents—And Economists Need Both (AI Integration 2/4)

Jun 14, 2025

What Are Agents? The Move Beyond Pure LLMs

In the previous post, I argued that there's value in economists becoming more agentic in incorporating AI into their workflows—moving beyond ad hoc ChatGPT queries to systematic integration that transforms how we actually work. I view the 46.2 percentage point gap between standalone LLMs and integrated systems as a call to action, though it's certainly up to each researcher whether to heed it. For those interested in exploring AI's potential in economics research, I believe there's value in shifting focus from AI models alone to AI integrations.

This focus on integration aligns with impressive results in AI that don't always get mainstream attention. While headlines focus on increasingly powerful universal models like ChatGPT and Claude, there has been a parallel "shift from models to compound AI systems," where state-of-the-art results increasingly come from carefully engineered systems with multiple interacting components, not just monolithic models.

The key insight is that non-LLM components systematically mitigate the key limitations of LLMs: hallucinations and lack of guaranteed correctness of output. Consider AlphaCode 2, which generates up to 1 million possible coding solutions then uses filtering and testing systems to identify the few that actually work. Or AlphaGeometry, which combines an LLM that suggests geometric constructions with a traditional symbolic solver that verifies mathematical correctness. In both cases, the LLM provides creative generation while other components handle verification—a division of labor that achieves results neither component could manage alone.

AI researcher Subbarao Kambhampati formalizes this insight in his LLM-Modulo framework: LLMs tightly interacting with external verifiers can possess abilities that neither component has in isolation. The LLM's broad knowledge helps identify promising candidates for verification while symbolic components ensure correctness. This is particularly relevant for economics because we already have sophisticated mathematical solvers and evaluation methods; the bottleneck is often finding the right candidates to verify. LLMs excel at generating plausible hypotheses, model specifications, or empirical strategies that our existing tools can then rigorously test, making our analytical capabilities far more accessible and useful.

The coding systems that have most successfully integrated into developer workflows—initially GitHub Copilot, and now Cursor, and similar tools—follow these same principles, combining models with execution environments and verification loops. These tools and their associated workflows provide a natural starting point for economists thinking about AI-based research workflows. We'll explore what economists can learn from these coding integrations in the post on workflows.

What exactly are these compound AI systems? They are systems that tackle AI tasks using multiple interacting components: multiple calls to models, retrievers, external tools, and verification loops. Perhaps the most vivid definition comes from Solomon Hykes' definition that Simon Willison highlighted: "an LLM wrecking its environment in a loop.”

While provocative, this phrasing captures something essential about what economists need from AI integration. "Wrecking its environment" means the LLM doesn't just generate text—it actually changes things: editing files, running queries, updating databases, creating new documents. The "loop" means it can observe the results of its actions and adjust accordingly, rather than just producing a single output and stopping.

Agents represent this capability through three core agentic abilities, which we can see clearly in coding systems:

Tool use: Command line execution that allows running code and editing files directly, rather than just suggesting what to type.

Reflection: Seeing the outcomes of running code—whether it worked, what errors occurred, what the output looks like—and adjusting accordingly.

Planning: Arranging sequences of tool use to achieve complex objectives—perhaps running tests first, then editing code based on failures, then running tests again.

These capabilities transform how we think about AI integration in economics. Instead of treating ChatGPT as sophisticated autocomplete, we can design systems that actually participate in our research workflows—reading papers, managing citations, cleaning data, running analyses, and drafting sections that require connecting multiple sources.

Integration Friction: A Problem Economists Should Recognize

To understand why agents matter for research, we need to think about frictions—the costs that prevent efficient transactions or, in our case, efficient workflow integration. Economists study frictions everywhere: transaction costs in markets, search costs in labor economics, coordination costs in organizations. Now we're experiencing a particular friction firsthand in our own work.

Consider your research environment: internal notes, literature PDFs, datasets, code files, analysis results, mathematical derivations, draft manuscripts. Now consider an LLM—ChatGPT in your browser or Claude through an API. Between your work environment and the LLM, there are arrows representing different types of information flow:

Copy-paste arrows: You manually extract context from your files to give the LLM, then manually incorporate its outputs back into your environment.

Verification arrows: You check whether LLM outputs are consistent with your notes, whether claimed facts appear in the literature, whether generated code actually runs and produces sensible results.

Navigation arrows: You search through your notes and literature to find relevant information, then piece together context across multiple sources.

If you use AI in the browser, you manually handle all these arrows. Every context switch, every copy-paste operation, every verification step requires your manual intervention. This human ownership of information flow creates friction—meaningful costs that make LLM integration less valuable than it could be.

This friction deserves economists' attention for both methodological and practical reasons. We study how technological capabilities translate into productivity gains, often focusing on adoption frictions and complementary organizational changes. Now we can observe these dynamics directly in our own workflows. The LLM capabilities exist, but realizing their potential requires systematic reduction of integration frictions.

This is also why generic "use AI for research" advice often falls flat. Telling someone to "ask ChatGPT for help with literature reviews" doesn't address the underlying friction: manually feeding context in and manually integrating outputs out. Of course, you can simply ask ChatGPT questions without providing context or establishing feedback loops—but this approach suffers from an even worse version of the invisible failures I discuss with Deep Research systems in the next post. Without proper context or verification mechanisms, you get plausible-sounding but potentially incorrect answers that are nearly impossible to validate.

The value proposition improves dramatically when we can reduce integration friction by letting agents handle more of the information flow directly.

A Stone Soup Example: The Six Steps of an AI Search Agent

The stone soup folk tale tells of a traveler who starts with just a stone and water, then gradually convinces villagers to contribute vegetables, meat, and seasonings until together they've created a rich soup that no one person could have made alone. AI researcher Subbarao Kambhampati (who we met earlier with the LLM-Modulo framework) uses this stone soup metaphor to explain how AI reasoning works—starting with an LLM component and gradually building sophisticated capabilities through systematic combination of tools and training methods. Modern AI agents work similarly—they start with your simple question and gradually add layers of capability (search, verification, synthesis) to deliver answers that neither you nor the LLM could produce in isolation.

To illustrate this process, let's walk through a simple example: using an AI assistant to answer the question, "Who won the Formula 1 championship in 2023?"

While the user only provides this single prompt, the agent performs a series of internal steps to deliver a reliable answer. Here is how the process unfolds:

1. Initial LLM Query

The agent receives the user's question: "Who won the Formula 1 championship in 2023?" The LLM generates an initial response, such as: "I'm not sure. My training data might be outdated. You may need to check an updated source." The user doesn't see this response.

2. Decide Whether a Search is Needed

A hidden prompt evaluates the LLM's answer for uncertainty. For example: "Does the answer indicate uncertainty? Answer with 'YES' or 'NO' only." If the answer is "YES," the agent proceeds to the next step.

3. Generate a Search Query

Through another hidden LLM call, the agent reformulates the user's question into a concise, search-engine-friendly query: "2023 Formula 1 World Champion winner."

4. Perform the Search and Retrieve Results

The agent executes the search programmatically via tool use using the generated query, retrieving results from sources like ESPN, BBC Sport, and Formula1.com. For example: "Max Verstappen clinches the 2023 F1 championship title."

5. Summarize Search Results

A hidden prompt instructs the agent to synthesize the retrieved information: "Summarize the key takeaway in one sentence." The agent produces: "Max Verstappen won the 2023 Formula 1 championship after securing the title at the Qatar Grand Prix."

6. Generate the Final Augmented Response

Finally, the agent crafts a user-facing reply that incorporates the verified information: "Max Verstappen won the 2023 Formula 1 championship after securing the title at the Qatar Grand Prix. Let me know if you'd like more details about the season!"

This multi-step process, largely invisible to the user, demonstrates how agents can combine LLM reasoning, prompt engineering, and external tool use to deliver accurate, up-to-date answers.

Like the stone soup story, each step adds a crucial ingredient: the LLM contributes the orchestrating (the stone), uncertainty detection adds quality control (the vegetables), search tools provide current information (the meat), and synthesis creates the final coherent answer (the seasoning). What appears to be a simple question-and-answer interaction is actually a coordinated process where multiple capabilities combine to create value that exceeds what any single component could achieve alone.

Recall the earlier environment–LLM–arrows mental picture: in an LLM-only workflow, the human researcher handles all the information flows—manually searching for information, verifying facts, and integrating results into their work. What the six hidden steps above reveal is a systematic reduction of integration friction through automated information flows. Instead of the human having to break context, run searches, check sources, and synthesize answers, the agent now performs these tasks internally. The user experiences a seamless, high-quality answer, but beneath the surface, the agent has automated the information flows that once required manual effort. This shift reduces friction, automates key parts of the research process, and expands what is possible in both speed and scale for research workflows.

Preview: In the next post, we'll move from these foundational concepts and examples to explore advanced agentic systems, and in the final post we will think about how economists can design their own research workflows for maximum transparency and efficiency.

Please Bring Strange Things

Discussion about this post