The Right Tool for the Mission: Big + Small AI in Government

Presented by REI Systems REI Systems's logo

A powerful new tool has landed in the government’s toolbox—sleek, versatile, and seemingly capable of handling any task. Generative AI (GenAI) is making its way not only into federal agencies but also into regulated sectors such as finance, healthcare, and defense. Large language models (LLMs) like GPT-4 have captured attention because they can draft text, summarize material, and reason across domains—all through a single interface. It was easy to believe one tool could replace every other in the box.

Yet government work is rarely about reaching for the same instrument every time. Missions demand accuracy, accountability, and efficiency. Agencies are now learning that while large models have their place, smaller and more specialized systems often serve the mission more effectively. The shift required is one of nuance: not “big or bust,” but “big and small, each where appropriate.”

Agencies have followed commercial industry peers in piloting LLMs for knowledge search, policy drafting, summarization, and conversational assistants. The Department of Defense, for example, has tested them for intelligence analysis, logistics planning, and even use at the tactical edge. Civilian agencies such as the FDA, GSA, and USPTO have also experimented with LLM-powered tools for knowledge management, intelligent search, casework assistance, and drafting official communications. In parallel, the commercial sector has adopted similar approaches—financial firms grounding GPT-4 in proprietary research to support advisors, and healthcare, pharmaceutical, and legal organizations using large models to streamline documentation, discovery and analysis. Together these pilots show promise but also reveal a trend: federal agencies, like their private-sector counterparts, are beginning to default to LLMs even for simple tasks—document classification, eligibility checks, FAQ responses—where smaller, targeted models would often be more accurate, and even more critically - faster, and cheaper. This “LLM first” reflex needs careful reconsideration if governments want sustainable, cost-effective and trustworthy AI adoption.

The Challenge with Relying Only on Large Models

Large models face serious limitations when treated as one-size-fits-all solutions. Accuracy is the first concern: trained on vast internet data, many LLMs occasionally “hallucinate,” producing answers that sound convincing but are wrong. In mission-critical services, even a one-percent error rate is unacceptable. Cost and latency add pressure: each call to a frontier model can be an order of magnitude more expensive—and slower—than a smaller alternative, a problem that compounds at scale. Privacy and compliance remain hurdles. Most frontier models are proprietary black boxes, leaving agencies unable to verify how data is handled or to explain outputs. Finally, reliance on a handful of vendors poses resilience risks. A change in terms, costs or a service outage could disrupt dependent systems.

Where Small Models Shine

Smaller, domain-specific models offer distinct advantages. Trained or fine-tuned on trusted and purpose-built datasets, they often surpass generic LLMs in task-specific accuracy. They deliver sub-second responses, require modest compute, and can be deployed on-premises or even at the edge—ideal for field operations or citizen-facing portals. Their costs are far lower, sometimes 10–30 times cheaper per query. They also offer stronger governance: deployed internally within the agency’s environment, they keep agencies in control of security, compliance, and auditability. In short, small models give governments speed, precision, and sovereignty—especially for well-scoped tasks where correctness outweighs general reasoning breadth, such as computer vision models that run directly on TSA or traffic management devices, or compact language models like Mistral 7B that agencies have already used to parse large document sets securely and efficiently.

Getting the Mix Right

The best strategies treat large and small models as collaborators in hybrid enterprise systems. The collaboration model can be set up in different ways: 

Agentic AI: Orchestration layers break complex tasks into subtasks and route them to the right tool—a classifier here, a retrieval system there, and a large model only when open-ended reasoning is needed.

Model cascades: Queries for focused, domain-specific requests go first to smaller models and escalate to larger ones only if confidence is low. Studies show cascades can match GPT-4’s accuracy while cutting costs by half or more.

Hybrid Domain-tuned RAG :  Agencies can reduce cost, latency, and hallucination risk by using small, domain-tuned models with retrieval-augmented generation (RAG) that ground outputs in vetted, agency-owned data rather than relying on frontier LLMs. In this setup, the small model handles most queries effectively, escalating to a larger model only for rare, cross-domain or novel questions, making it a more resilient and efficient default.

Edge–cloud hybrids: Lightweight models run locally at the edge or in the field, for latency- or privacy-sensitive tasks, while occasional escalations to cloud-hosted LLMs handle the most complex cases. This aligns with DoD recognition that many missions require offline operability on standard hardware.

A Decision Framework for U.S. Agencies

Criterion

Choose Small Model(s)

Escalate to Large Model(s)

Mission criticality & accuracy

Benefits, compliance, regulated guidance—domain-tuned models with citations and review.

Ambiguous or complex reasoning beyond domain scope.

Data sensitivity

Classified or regulated data—use open-source or on-prem small models.

Only if secure channels exist and outputs are auditable.

Latency & cost

High-volume, real-time citizen services—fast, low-cost responses.

Occasional, nuanced, or infrequent requests.

Task scope

Stable, well-scoped tasks - extraction, classification, summarization.

Open-ended, cross-domain synthesis needing broad context.

Resilience & operability

Multi-model orchestration with monitoring, human checkpoints, vendor flexibility.

Keep as backup for exceptional cases, not as default.

Managing the Tradeoffs

A multi-model strategy carries risks, but they’re manageable. Treat models as microservices to reduce integration errors, use cascades to handle edge cases, and ground outputs with vetted data to protect privacy and compliance. Avoid vendor lock-in by keeping options interchangeable and expertise in-house. Above all, enforce responsible AI practices with audits, documentation, governance, and human oversight. Done well, hybrid approaches are safer than reliance on a single black-box model.

The Right Tool for the Right Job

The U.S. public sector doesn’t need the biggest tool for every job—it needs the right one for the mission at hand. The modern / new age playbook is clear: orchestrate with agents for scale, cascade for efficiency, ground with retrieval for truth, and balance edge and cloud for resilience. Industry leaders show how guardrails build trust; defense highlights why compact, edge-ready models are indispensable. According to Gartner, organizations are projected to use small, task-specific AI models three times more than general-purpose LLMs by 2027—a clear signal of a shift toward domain-specific systems. For government, the challenge is to stop reflexively reaching for the “power drill” when a scalpel, a wrench, or a hammer will serve better. In this next phase, Big AI and Small AI together can build the cost-effective accuracy, speed, security, and accountability that citizens rightly expect.


By Anand Trivedi, REI Systems’ AI Offering Lead, and Avi Mehta, Principal AI Engineer, REI Systems

This content is made possible by our sponsor REI Systems; it is not written by and does not necessarily reflect the views of GovExec's editorial staff. 

NEXT STORY: 2025 Year in Review