Generating the comparison dataset
For each LLM, you need 100 valid API responses
All the responses should have valid JSON
- they do not have to be pure JSON, but they should at least contain valid JSON
Do a retry until each of the 100 reports contains valid JSON
- The number of retries required to ensure this is itself a measure of the LLM’s quality