Use the gold dataset to calculate accuracy for all LLMs

  • Use Marimo to calculate the “diff” between correct answer and LLM’s answer (demo)
  • Calculate accuracy for each LLM
  • Save the diff as a permanent record into a CSV file for future reference