LLM eval: A case study using multiple Gemini models

In my Udemy course, I build a system where I use four different Gemini models to provide an example of LLM eval – where an LLM is a judge for whether or not an answer is correct.

Basic idea:

Gemini now has more than four fairly decent LLMs
Ask four LLMs to do the same data extraction task
If there is a majority (3 or 4 agree) vote, use that as the correct answer
For the rest, manually select the correct answer using the LLM Eval tool
Save everything into a gold dataset
Use the gold dataset to calculate accuracy of each LLM

This system can be extended to any dataset and to any LLM that you want to benchmark.

Leave a Reply Cancel reply