LLM Accuracy Comparisons for Structured Outputs

All lesson notes associated with the LLM Accuracy Comparisons for Structured Outputs course

Gemini Pro 2.5 vs GPT 5 (Full) JSON metrics

A comparison Demo

Read More Gemini Pro 2.5 vs GPT 5 (Full) JSON metrics
Compare accuracy for each field

Use the system defined in the previous chapter for a single field Evaluate and compare the accuracy for each field (which is comparable) Use Citations and Explanation for the adjudication

Read More Compare accuracy for each field
JSON Metrics

We use four metrics related to the JSON responses coming back from LLMs api_error If response_full.json() throws an error, that is considered an api_error is_pure_json If json.loads(inner_response_text) does not throw any error, then is_pure_json is true contains_valid_json Sometimes you have inner_response_text which looks like this: This would be perfectly valid JSON once you remove the…

Read More JSON Metrics
Send 100 requests to the LLM on OpenRouter

This is the method used in the code (which is called 100 times) Later in the code, I will be saving all this information into a JSON file. Make note of the different variables used in this code snippet because we will be revisiting them over the next few lessons.

Read More Send 100 requests to the LLM on OpenRouter
Use the gold dataset to calculate accuracy for all LLMs

Read More Use the gold dataset to calculate accuracy for all LLMs
Use DataBlist to generate the gold dataset

Read More Use DataBlist to generate the gold dataset
Consolidate multiple results into a single CSV file

Read More Consolidate multiple results into a single CSV file
Run the same experiment using four LLMs

Read More Run the same experiment using four LLMs
Measuring schema compliance using Structured Output Percentage Stats

What the stats look like inside Marimo

Read More Measuring schema compliance using Structured Output Percentage Stats
Send 100 requests to an LLM using OpenRouter

Read More Send 100 requests to an LLM using OpenRouter
Why the empty NUMDAYS is a good test case

Read More Why the empty NUMDAYS is a good test case
Calculating an upper limit for numdays

Read More Calculating an upper limit for numdays