Generating the comparison dataset

For each LLM, you need 100 valid API responses

All the responses should have valid JSON

Do a retry until each of the 100 reports contains valid JSON

The number of retries required to ensure this is itself a measure of the LLM’s quality