Generating the comparison dataset

For each LLM, you need 100 valid API responses

All the responses should have valid JSON

  • they do not have to be pure JSON, but they should at least contain valid JSON

Do a retry until each of the 100 reports contains valid JSON

  • The number of retries required to ensure this is itself a measure of the LLM’s quality