LLM eval: A case study using multiple Gemini models
In my Udemy course, I build a system where I use four different Gemini models to provide an example of LLM eval – where an LLM is a judge for whether or not an answer is correct.

Basic idea:
- Gemini now has more than four fairly decent LLMs
- Ask four LLMs to do the same data extraction task
- If there is a majority (3 or 4 agree) vote, use that as the correct answer
- For the rest, manually select the correct answer using the LLM Eval tool
- Save everything into a gold dataset
- Use the gold dataset to calculate accuracy of each LLM
This system can be extended to any dataset and to any LLM that you want to benchmark.