GPT5 vs Gemini Pro 2.5 for Pydantic structured output using OpenRouter

I tested sending 100 requests each to both GPT-5 and Gemini Pro 2.5 to extract information from clinical narratives into a very complex Pydantic schema

I used the following metrics:

is_pure_json

contains_valid_json

is_valid_schema

to measure how well they adhered to the Pydantic schema

Sometimes the LLMs will not even produce valid JSON as a substring of the response, in which case I would do a retry for that particular clinical narrative (until I eventually got contains_valid_json = True for all 100 responses)

These are the results

Metric	GPT-5	Gemini Pro 2.5
Number of retries	24	4
is_pure_json	100	0
contains_valid_json	100	100
is_valid_schema	99	28

I discuss the workflow I used (for measuring all these values) in my Prompt Engineering for Structured Outputs course.

Which one should you use?

There are actually a few more tradeoffs to consider.

GPT-5 needs more retries

GPT-5 needs a total of 24 retries as compared to just 4 retries for Gemini Pro 2.5 (on OpenRouter) to get valid JSON. I think this happens because GPT-5 is trying very hard behind the scenes to produce perfectly valid schema, which becomes quite a hard task for a sufficiently complex schema.

And you can of course make your Pydantic schema as complex as you want, bounded only by the limit of what you can understand in the future.

This is why I think there is still a long way to go for these LLMs to become truly impressive (in my book).

Gemini Pro 2.5 is faster

The first thing to note is that Gemini Pro 2.5 is quite a bit faster than GPT-5 when it comes to extracting structured data

Some stats for the 100 requests

Metric	GPT-5	Gemini Pro 2.5
Median Elapsed Time in seconds	288	79
Mean Elapsed Time in seconds	284	83
Minimum Elapsed Time (s)	165	55
Maximum Elapsed Time (s)	419	141

This means if you are very concerned about the response latency of structured outputs, Gemini Pro 2.5 is a better choice.

Gemini Pro 2.5 is less verbose

Although you do want your LLM response to be clear (especially the Explanation field) you do not want the LLM to be too wasteful in terms of output tokens

Metric	Gemini Pro 2.5	GPT 5
Completion Tokens Median	9229	13928
Completion Tokens Mean	9800	13626
Completion Tokens Min	5882	7382
Completion Tokens Max	17017	18078

Which one should you use?

GPT-5 needs more retries

Gemini Pro 2.5 is faster

Gemini Pro 2.5 is less verbose

Leave a Reply Cancel reply