Why Google Gemini is the best choice for LLM evals (LLM as a judge)

In a recent article, Simon Willison wrote this

In my opinion, Google Gemini models make great choices for LLM evals

there are about 7 different Gemini LLMs which are fairly decent in terms of quality
- you only need 4 different LLMs for the system I recommend
most of them are very cost efficient
- for Tier 1 (which means you provide your credit card information), the rate limits are very generous
- you will usually be able to generate the entire evaluation set for free
they are now fairly accurate
- there has been a major improvement in recent Gemini models
Gemini supports pydantic BaseModel based structured data extraction
- the schema validation is done server side
- reduces code complexity
  - might need to use the instructor library otherwise
- reduces cost
  - no need for retries (which instructor might do)
- reduces latency
  - for the same reason – no need for retries
- Note: sometimes Gemini can fail to do the extraction and you might have to do a retry, but it will happen on the server side