We know that reviewing and managing contracts with AI has made a meaningful difference for legal professionals. And as more and more companies adopt AI to review and extract intelligence from contracts, we’re flowing more and more data through frontier AI models. Large enterprises manage, on average, 350 contracts a week. That’s a lot of information to query and can cost a great deal both in money and time.
As AI has grown and matured, however, new, smaller, and more focused models have emerged that can handle different types of queries. So my research right now centers on being able to dynamically decide which types of queries should go to which types of models. It’s like using a scalpel vs a bazooka. Some use cases are best suited for small, focused models, and others are best suited for generalist models.
You want to use smaller, simpler models for simpler questions, and reserve the big frontier model for when it's actually needed. That way you preserve quality while gaining speed and cost efficiency. Smaller models are exponentially faster. The idea I'm thinking a lot about is: what is a simple question, and what is not a simple question? Once the tool can figure it out, it can route accordingly.
The models that people think of as "AI" like ChatGPT or Claude were designed from the start to be generalists. They were trained on essentially the entire web, and kept growing in size until researchers said, okay, these are too big, let's try to shrink them while preserving capability.
Then a different line of research emerged: what if we made models much smaller and trained them for specific tasks? It turns out that for narrow, well-defined tasks, smaller models can actually beat frontier models, especially after fine-tuning. Recent examples include Chroma’s Context-1 model for information retrieval, while being 10x faster and 25x cheaper; Cursor’s Compose-2, a fine-tuned version of Kimi K2.5 that provides frontier level coding capabilities at a fraction of cost, and Intercom’s Fin Apex, a model fine-tuned for customer service that beats frontier models and provide 65% less hallucination and better resolution rate.
Using frontier models for everything is fine until it isn't. It's expensive, it's slow, and it's entirely outside our control. These models might live on Anthropic's infrastructure, or Azure, or whatever cloud platform. If there's an outage, we're dependent on someone else to fix it. Any centralized service has a risk of disruption due to geopolitical instability, or environmental issues, or any number of factors.
The nice thing about smaller models is that you have much more control over them. You can host them yourself. You could run them locally, or on servers in whatever region your clients require, perhaps to meet UK data residency requirements, for example. There's engineering work involved in hosting and scaling that infrastructure, but you gain control of the model, you get independence from pricing changes, the work is much less resource-intensive, and is a fraction of the cost.
One of the things I find most fascinating about the AI legal space is that the biggest blocker to my research is the lack of legal benchmarks. Legal has almost no benchmarks. There are a few contract-specific ones out there — limited, mostly English-only, not perfect. But the deeper issue is that legal work has never been systematically evaluated. Medicine has outcomes. Code either works or it doesn't. In the legal industry, if a contract term is wrong, sometimes nobody finds out until something goes to litigation.
As we start using AI for more aspects of legal work, we need solid, curated benchmarks so that when we release or update a new model we can compare them: here are the questions, here are the answers we got yesterday, here's what we got today. What changed? Is it better or worse? We're trying to apply quantification to a field that's never had it before. There are really only two options: either trust the frontier model completely, which is risky, because they can make mistakes on surprisingly simple things, or get humans to actually label a set of contracts and build a reference benchmark from that.
A number of larger companies are already moving toward dynamic model selection, but my prediction is that legal AI vendors are going to be thinking about how to do this in the future. I believe that thinking through what use cases need a scalpel vs a bazooka will have a strong influence on how legal AI products are built.
Schedule a demo today.
