Why Google Gemini is the best choice for LLM evals (LLM as a judge)
In a recent article, Simon Willison wrote this
In my opinion, Google Gemini models make great choices for LLM evals
- there are about 7 different Gemini LLMs which are fairly decent in terms of quality
- you only need 4 different LLMs for the system I recommend
- most of them are very cost efficient
- for Tier 1 (which means you provide your credit card information), the rate limits are very generous
- you will usually be able to generate the entire evaluation set for free
- they are now fairly accurate
- there has been a major improvement in recent Gemini models
- Gemini supports pydantic BaseModel based structured data extraction
- the schema validation is done server side
- reduces code complexity
- might need to use the instructor library otherwise
- reduces cost
- no need for retries (which instructor might do)
- reduces latency
- for the same reason – no need for retries
- Note: sometimes Gemini can fail to do the extraction and you might have to do a retry, but it will happen on the server side