I have created a custom LLM eval tool for my example, but you can build a similar one which best matches the dataset you are interested in.
This is what my custom eval tool looks like for VAERS reports
The Eval tool will help you quickly provide the correct value for cases where there is no majority agreement.
Often, this substantially reduces the effort required to create the gold dataset.