How to extract ONSET_DATE from VAERS write-ups using Google Gemini
The field called ONSET_DATE describes the date of onset for the earliest symptom (after vaccination).
Here is an example
Feeling cold; Chills and shivering; high sweating; This is a spontaneous report from a contactable physician, received from the Regulatory Authority, regulatory authority report number of v20100815 and v20100816, and also received information via System. A 47-year-old female received first dose of bnt162b2 (COMIRNATY, lot number: EP2163, expiration date: 31May2021) via intramuscular at a single dose on 19Feb2021 at 14:05 at left arm for Covid-19 immunisation. Medical history included the histopathological diagnosis of neurofibromatosis, type 1 (von Recklinghausen's disease) from an unknown date, and the past history of excision of neurofibromatosis, no history of allergy, no relevant family history. The patient's concomitant medications were not reported. Body temperature before the vaccination was 35.9 degrees Centigrade on 19Feb2021. On 19Feb2021 at 14:30 (the day of vaccinations), the patient experienced feeling cold, chills and shivering. Patient also had high sweating on unspecified date in Feb2021. The clinical course was provided as follows: On 19Feb2021, the patient was asymptomatic 15 minutes after the inoculation, but suddenly feeling cold, chills and shivering appeared 30 minutes after just before going back to work. There was no decrease in blood pressure, impaired consciousness, nor bradycardia. Due to the high degree of coldness, the saline infusion was started, and heating with an electric blanket was started, and then the symptoms improved shortly. The patient was hospitalized for observation purposes. On 20Feb2021 (1 day after the vaccination), she was discharged because her symptoms disappeared without any significant changes in vital signs. The outcome of all events was recovered on 20Feb2021. The patient was not pregnant at the time of vaccination. Prior to vaccination, the patient was not diagnosed with COVID-19. Since the vaccination, the patient had not been tested for COVID-19. The physician classified seriousness criteria for the events as serious (seriousness criterion: hospitalization) and assessed the causality was unable to determine for bnt162b2, unknown for Recklinghausen's disease. The patient was hospitalized for the events from 19Feb2021 to 20Feb2021. Reporter comment: Although it was not a typical anaphylactic symptom, it was reported because of generalized cold sensation and high sweating. Upon the blood collection, there were no abnormal findings such as hypoglycemia. The causal relationship with von Recklinghausen's disease was also unknown.; Reporter's Comments: Although it was not a typical anaphylactic symptom, it was reported because of generalized cold sensation and high sweating. Upon the blood collection, there were no abnormal findings such as hypoglycemia. The causal relationship with von Recklinghausen's disease was also unknown.
So the LLM has to do a bit of reasoning – it first needs to identify all the relevant dates from the writeup, and then decide which one is the earliest which is also after the date of vaccination.
This is the Python code to extract the onset date from the writeup
import json
import pandas as pd
from google import genai
from pydantic import BaseModel, Field
import os
from dotenv import load_dotenv
from google.genai import types
import time
load_dotenv()
gemini_api_key = os.getenv('GEMINI_API_KEY')
class OnsetDate(BaseModel):
"""Model for extracting onset date information from VAERS data."""
onset_date: str = Field(description="Date of earliest symptom onset in YYYY-MM-DD format. If the date is unknown, use YYYY-MM-UNK. If the month is also unknown, use YYYY-UNK. If no information is provided just use UNK")
onset_date_explanation: str = Field(
description="Verbatim sentence from the symptom_text which provides a citation for the onset date")
client = genai.Client(api_key=gemini_api_key)
df: pd.DataFrame = pd.read_csv(f'../csv/llm/japan_100.csv')
df.columns = [x.upper() for x in df.columns]
model_name = 'gemini-2.0-flash-lite'
experiment = 'onset_date'
full_json = {}
num_rows = 100
for index, row in df.head(num_rows).iterrows():
symptom_text = row['SYMPTOM_TEXT']
vaers_id = row['VAERS_ID']
print(f'Processing {vaers_id}')
start_time = time.time()
prompt = symptom_text
response = client.models.generate_content(
model=model_name,
contents=prompt,
config=types.GenerateContentConfig(
system_instruction="You are a VAERS expert, and your goal is to read the symptom_text and provide the output in the specified schema",
response_mime_type='application/json',
response_schema=OnsetDate,
)
)
elapsed = time.time() - start_time
response_json = json.loads(response.model_dump_json())
full_json[vaers_id] = {
"response": response_json,
"parsed": response_json['parsed'],
"duration": elapsed,
"prompt": prompt
}
file_name = f'../json/{experiment}/{experiment}_{model_name}.json'
with open(file_name, 'w+') as f:
json.dump(full_json, f, indent=2)