GPT5 vs Gemini Pro 2.5 for Pydantic structured output using OpenRouter
I tested sending 100 requests each to both GPT-5 and Gemini Pro 2.5 to extract information from clinical narratives into a very complex Pydantic schema
I used the following metrics:
is_pure_json
contains_valid_json
is_valid_schema
to measure how well they adhered to the Pydantic schema
Sometimes the LLMs will not even produce valid JSON as a substring of the response, in which case I would do a retry for that particular clinical narrative (until I eventually got contains_valid_json = True for all 100 responses)
These are the results
Metric | GPT-5 | Gemini Pro 2.5 |
Number of retries | 24 | 4 |
is_pure_json | 100 | 0 |
contains_valid_json | 100 | 100 |
is_valid_schema | 99 | 28 |
I discuss the workflow I used (for measuring all these values) in my Prompt Engineering for Structured Outputs course.
Which one should you use?
There are actually a few more tradeoffs to consider.
GPT-5 needs more retries
GPT-5 needs a total of 24 retries as compared to just 4 retries for Gemini Pro 2.5 (on OpenRouter) to get valid JSON. I think this happens because GPT-5 is trying very hard behind the scenes to produce perfectly valid schema, which becomes quite a hard task for a sufficiently complex schema.
And you can of course make your Pydantic schema as complex as you want, bounded only by the limit of what you can understand in the future.
This is why I think there is still a long way to go for these LLMs to become truly impressive (in my book).
Gemini Pro 2.5 is faster
The first thing to note is that Gemini Pro 2.5 is quite a bit faster than GPT-5 when it comes to extracting structured data
Some stats for the 100 requests
Metric | GPT-5 | Gemini Pro 2.5 |
Median Elapsed Time in seconds | 288 | 79 |
Mean Elapsed Time in seconds | 284 | 83 |
Minimum Elapsed Time (s) | 165 | 55 |
Maximum Elapsed Time (s) | 419 | 141 |
This means if you are very concerned about the response latency of structured outputs, Gemini Pro 2.5 is a better choice.
Gemini Pro 2.5 is less verbose
Although you do want your LLM response to be clear (especially the Explanation field) you do not want the LLM to be too wasteful in terms of output tokens
Metric | Gemini Pro 2.5 | GPT 5 |
Completion Tokens Median | 9229 | 13928 |
Completion Tokens Mean | 9800 | 13626 |
Completion Tokens Min | 5882 | 7382 |
Completion Tokens Max | 17017 | 18078 |