How I compare the Pydantic schema adherence of multiple LLMs on OpenRouter

This is process I use to compare the Pydantic Schema adherence of multiple LLMs on OpenRouter

I first send 100 requests to the LLM (these are async requests, sent 5 requests concurrently)

Most of the responses contain valid_json but some of them don’t

Some of the requests also throw an api_error in which case contains_valid_json will automatically be False (in other words the response will not contain valid JSON).

I retry all requests which don’t have valid JSON – this automatically also includes those which threw an API error during the last request batch.

For every retry batch, most of the responses end up with valid JSON – in other words, the batch size itself keeps shrinking.

I keep doing more retrys until all responses have valid JSON.