How to choose the best LLM for your project
If you are trying to get structured outputs from unstructured text using an LLM, you now have a lot of choices.
For example, as of 22 Sep 2025, OpenRouter lists more than 500 models, and many new models are getting added each week.
Are you using OpenRouter?
Here is a measurable way to increase the RoI of your OpenRouter investment
I used Claude to generate some charts for the growth of LLMs on OpenRouter
Number of models added by month
Median input price per million tokens has plummeted over time
As has the median price per million output tokens
The median context length has steadily increased, although it has plateaued a bit recently
Structured Outputs
Prolific LLM blogger Simon Willison said this recently:
I’ve suspected for a while that the single most commercially valuable application of LLMs is turning unstructured content into structured data. That’s the trick where you feed an LLM an article, or a PDF, or a screenshot and use it to turn that into JSON or CSV or some other structured format.
You can use this step by step process to increase the RoI of your OpenRouter investment.
1 Test the LLM using a complex Pydantic schema
A good way to assess the quality of an LLM is to send it some input and a complex Pydantic schema and ask it to extract information from the input text such that it produces structured output which satisfies the schema.
You can take a look at the output and you will know right away that some LLMs do not have the necessary quality to accomplish the task.
How to use Pydantic with OpenRouter API
2 Measure syntactic validity
Once you have shortlisted some LLMs, send 100 requests to each one using the same complex Pydantic schema.
You will notice two things:
- Not all responses will include valid JSON
- Sometimes you will even get API errors because the LLM will “choke” as it is trying to generate structured data if you send it a really complex schema 🙂
But if you retry the prompts which failed, you will eventually be able to get syntactically valid JSON for all 100 inputs.
3 Measure semantic validity
Once you have syntactically valid JSON, the next step is to verify semantic validity – how well the JSON follows the Pydantic schema.
I was a bit surprised to find how often even the expensive LLMs fail the schema validation test, especially if you are trying to extract a complex schema (like this one)
4 Measure accuracy for a single field
How can you measure the accuracy of the LLM response? In other words, how do you know if the answer is correct without manually verifying every result?
One option is to send the same request to multiple LLMs and have a majority vote. If the majority does not agree then manually inspect the responses. This will lead to a much smaller workload. I explain this process in my Measuring LLM Accuracy for Structured Outputs course.
5 Engineer Prompts for better results
Once you have all the building blocks in place, you can start engineering your prompts to improve each step of this process.
In my Prompt Engineering for Structured Outputs I give some ideas and tips on how to do this (course will be completed by end of Sep 2025)
6 Consider the tradeoffs
Once you have run a few experiments with multiple LLMs on the dataset you are using, you can choose the best LLM by choosing the one which provides the best mixture of cost and accuracy (depending on your use case, you might also want to consider the latency)
Often, you will be able to lower the cost by writing some Python code on your end to minimize retries.
Text Centric Data Science for the LLM Era
Coming soon (Oct 2025)




