LLM eval: A case study using multiple Gemini models

In my Udemy course, I build a system where I use four different Gemini models to provide an example of LLM eval – where an LLM is a judge for whether or not an answer is correct.

Basic idea:

  • Gemini now has more than four fairly decent LLMs
  • Ask four LLMs to do the same data extraction task
  • If there is a majority (3 or 4 agree) vote, use that as the correct answer
  • For the rest, manually select the correct answer using the LLM Eval tool
  • Save everything into a gold dataset
  • Use the gold dataset to calculate accuracy of each LLM

This system can be extended to any dataset and to any LLM that you want to benchmark.

Leave a Reply