draft-reasoned_extraction_udemy

Crash Course

Welcome

Table of Contents

Welcome

In about one hour you will evaluate LLMs on real V-safe neuropathy narratives using browser review tools—not generic chat demos.

You will compare DeepSeek v4 to closed-weight models (Claude, GPT, Gemini) with citation-backed tables, a three-model gold panel, and field-level summaries.

Prerequisites: a browser; basic JSON fields (value, explanation, citation). No programming required for this track—the tools run in the browser. Building the full pipeline from scratch involves code; this course focuses on the visual workflow and how comparison works.

Demo — Gold Review at a Glance

Gold review overview

Open gold_review.html first—this is the main place you see model outputs side by side.

What you are looking at

Three gold columns — reference runs (Claude Opus, GPT-5.5, Gemini Pro in the default bundle).
One test column — your candidate (for example DeepSeek v4 Flash), chosen from the toolbar.
Each row — one registrant; the active field tab shows how every model answered that PACVS field for that person.

Screen regions

Region	Role
Toolbar	Test model, row filters, live stats
Field sidebar	PACVS fields; orange • marks fields with test mismatches vs agreeing gold
Comparison table	Gold + test cells (value, explanation, citation pills)
Markdown pane	Source narrative; pills jump to numbered cite rows

The UI is read-only: it loads assessments in the browser and does not edit files on disk.

Later lessons define EXACT and RIGOUR and go deeper on filters and citations. For now, scan the layout and notice how much is visible in one screen.

Key takeaway: Gold review is the visual home base—three gold models vs one test, one field at a time.

The Four Review Pages

Page	Role
`gold_review.html`	3 gold + 1 test — per registrant (demo above)
`comparison.html`	Pick model pairs → Summary or Detailed
`summary.html`	Field-level counts vs gold (two models side by side)
`llm_compare/llm_compare.html`	Left vs right EXACT cells, registrant by registrant

All four read the same assessments and source markdown. The Appendix — Background summarizes V-safe and PACVS if those terms are new.

For code and bundle generation, see the full Reasoned Extraction course—or contact the instructor for your use case.

Cost, Quality, and the Gold Panel

Inference cost: DeepSeek v4–class models are often roughly 10× cheaper than the closed-weight references in the gold panel (OpenAI GPT, Anthropic Claude, Google Gemini Pro). At scale, that gap matters.

This course does not claim DeepSeek is perfect. Three closed-weight models are the gold panel; DeepSeek (or any candidate) is the test scored against them—that encodes a higher-trust reference tier, not “cheapest wins.”

Trade-off you will measure

Gold tier — stronger baseline (gold can still split on ambiguous cases).
Test tier — may diverge; summary.html and gold review show where and how often.

You may accept slightly lower alignment to gold for large savings when citation-backed review shows the gap is acceptable for your workflow.

Key takeaway: Decide if cheaper is good enough using gold-aligned metrics—not marketing.

EXACT & Gold Panel

Reasoned Extraction and EXACT

Reasoned extraction means each PACVS field returns value, explanation, and citation—not a bare label.

EXACT — Explainable eXtraction Assessment using Citation Tables:

Letter	Term
E	Explainable (`explanation` on every field)
X	eXtraction → `PACVSNeuropathyCase` JSON
A	Assessment (PACVS rubric, including `is_pacvs`)
C	Citation (indices into numbered source rows)
T	Tables — source cite table + review grid + `summary.html` rollup

"is_pacvs": {
  "value": true,
  "explanation": "<p class='pacvs-summary'>…</p>…",
  "citation": [3, 7]
}

Key takeaway: Audit the citation table, not only the boolean.

Pipeline in One Pass

Manifest — fixed registrant set.
V-safe markdown — narrative + numbered source citation table (column 1 = cite #).
LLM extraction — *.assessment.json per model and registrant.
Review UIs — gold review, comparison index, summary, detailed compare.
Human review — filters, row colors, citation pills.

Default gold models: Claude Opus, GPT-5.5, Gemini. Test / left / right: DeepSeek v4 and other runs in the bundle.

Key takeaway: This crash course is steps 4–5 on the live bundle.

RIGOUR: Three Gold Models and One Test

Gold vs test columns

RIGOUR names the evaluation discipline that pairs with EXACT (what you inspect in each cell):

Letter	Term	Meaning in this project
R	Reasoned	Gold and test outputs are reasoned extractions (`value`, `explanation`, `citation`)—not bare labels.
I	Inference	Each model infers PACVS fields from the same registrant narrative.
G	Gold-standard	A fixed panel of three reference models (Claude Opus, GPT-5.5, Gemini in the default bundle).
O	Outcomes	Per-field outcomes on each registrant: agree, split, majority match, not comparable.
U	Under	The test model is scored under that panel—one candidate column vs three gold columns.
R	Review	Human review in `gold_review.html` and `summary.html`: filters, row colors, citation checks.

Full expansion: Reasoned Inference with Gold-standard Outcomes Under Review.

How it works in gold review

Gold agree — all three gold values match → strong baseline.
Gold split — gold disagree → check whether the test matches 2-of-3 majority.
No majority — gold tie; read citations manually (gray / dedicated filter).
Change only the test dropdown; gold columns stay fixed.

Short values (≤ ~20 characters in gold review) auto-compare; long text stays visible but may be not comparable (gray rows). The same agree / split / majority buckets appear as counts on summary.html.

Key takeaway: EXACT is what you read in each cell; RIGOUR is how you judge a test model against the gold panel.

Gold Review

EXACT Cells and Citations

Markdown and citations

Back in gold_review.html (demo lesson): each table cell is one EXACT record.

Value (bold)
Explanation (often PACVS heuristic HTML; blue pacvs-summary when present)
Citation pills → numbered rows in the markdown pane

Click a registrant heading → load that person’s source narrative.
Click a pill → jump to cite-{slug}-{n} and highlight the row.

Key takeaway: Disagreements are settled in the source table, not by trusting prose alone.

Filters and Row Colors

Toolbar filters

Filter	When to use
Gold agree · test differs	First pass on a new model
Gold disagree · test = majority	Gold split; test tracks plurality
Gold disagree · test ≠ majority	Test outlier on a split
No majority	Tie — read citations manually

Color	Meaning
Green	Gold agree · test matches
Red	Gold agree · test differs
Yellow	Gold split
Gray	Not comparable

Key takeaway: Red fixes the test model; yellow fixes your reading of the chart.

Hands-On: Five-Minute Gold Review

Field table

Set test to DeepSeek v4 Flash (or your candidate).
Open is_pacvs → filter Gold agree · test differs.
For each row: compare cells → click citation pills.
Repeat recommend_compensation.
Note field tabs with • — weakest dimensions.

Key takeaway: Filtered gold review beats reading raw JSON.

Compare & Summary

comparison.html and summary.html

Comparison index

comparison.html — pick a model pair. Each row offers:

Link	Page	Question
Summary	`summary.html`	How often does each model match gold by field?
Detailed	`llm_compare.html`	Who differs, with full EXACT cells?

summary.html — Model A vs Model B, one row per PACVS field:

Agree+match / Agree+diff — gold unanimous
Split=maj / Split≠maj — gold split vs test
Same gold panel as gold review; counts across the whole manifest

Optional: ?left=deepseek__deepseek-v4-flash&right=openai__gpt-5.5.

Key takeaway: Summary before detailed — find weak fields in seconds.

Summary table

Detailed Compare and Which Tool When

LLM compare toolbar

Detailed compare — left vs right (no gold columns):

Filters: Differ, Match, No file
Field tabs with • where models disagree

Tool	Best for
Summary	Field-level scoreboard (two models)
Detailed	Registrant drill-down, Differ filter
Gold review	Official test vs 3-gold panel
Comparison index	Navigation

Key takeaway: Summary = where; Detailed = who; Gold review = test vs gold on each person.

DeepSeek vs Closed-Weight: One Workflow

Differ vs match

comparison.html → DeepSeek vs GPT (or Claude).
Summary — fields with high Agree+diff / Split≠maj on DeepSeek.
Detailed — Differ on those fields; verify citations.
Gold review — same model as test; confirm Gold agree · test differs.

Show both aggregate counts and one cite-backed example when you report results.

Key takeaway: Never compare models on labels alone—use summary + citations.

Wrap-Up

Gold Split and Common Mistakes

Gold split

Gold split (yellow): gold models disagree. Check whether test = 2-of-3 majority before blaming the test model.

Mistake	Fix
Value-only judgment	Read explanation + citations
Skipping summary	Run field rollup first
Treating yellow as “test wrong”	Adjudicate the narrative
Ignoring gray rows	Long fields still matter qualitatively

Key takeaway: Same majority logic in gold review rows and summary columns.

End-to-End Checklist and Quick Reference

Match vs mismatch

Checklist

comparison.html — both models have runs.
summary.html — weak fields by count.
Detailed compare — Differ + citation check.
gold_review.html — test model, Gold agree · test differs.
Record field, registrant, and cite row for each finding.

Quick reference

Tool	Purpose
`gold_review.html`	3 gold + 1 test
`comparison.html`	Pair picker
`summary.html`	Field vs gold counts
`llm_compare.html`	Left vs right EXACT

EXACT: explainable JSON → citation tables in the UI.

When reporting results to stakeholders, pair cost (e.g. ~10× lower inference for DeepSeek-class models vs the gold-tier APIs) with measured alignment (summary counts, gold agree · test differs, citation spot-checks)—not either one alone.

For code, PACVS rubric detail, and V-safe narratives, continue with the full Reasoned Extraction lesson set when you are ready to build or extend the bundle. For a specific domain or deployment, contact the instructor to adapt this evaluation pattern to your use case.

Appendix — Background

V-safe Overview

V-safe is the CDC’s smartphone program for active monitoring of COVID-19 vaccine safety. After vaccination, participants can report symptoms and follow-ups over time.

This course uses a small research sample of de-identified registrant stories related to neuropathy—not the live V-safe system and not the full national database.

What you see in the review tools

Each registrant has a stable study code (shown as the row heading in gold review).
The right-hand pane shows that person’s narrative—the same text models read during extraction.
Numbered rows in the narrative are the source citation table. Citation pills in model cells jump to those row numbers so you can check whether a label matches the story.

What stories usually mention

Vaccine product, dose, and timing
Days from vaccination to symptom onset
Neuropathy quality (tingling, burning, patchy vs widespread, facial involvement)
Other symptoms (fatigue, brain fog, headache, dysautonomia-like features)
Competing explanations (prior COVID, diabetes, B12, thyroid, and similar)

Structured PACVS fields (Appendix lesson on PACVS) are an interpretation of this narrative. The narrative remains the source of truth when models disagree.

Scope limits

Study codes only—no re-identification of individuals.
Research and model comparison—not clinical decision support or official pharmacovigilance workflow.

PACVS Overview

PACVS (Post-Acute COVID-Vaccination Syndrome) is a clinical pattern used here to classify certain neuropathy presentations after COVID-19 vaccination—especially when symptoms cluster with other systemic features and onset is close in time to a dose.

The review tools show one PACVS assessment per model per registrant: many fields (timing, pattern, cluster symptoms, objections, compensation stance) plus a top-level is_pacvs judgment.

Distinctions the rubric keeps in play

PACVS neuropathy (is_pacvs true) — meets the project threshold on structured criteria
Vaccine-related neuropathy, not full PACVS — injury after vaccine without the full pattern
Overlap but low likelihood — some features, not a clear case
Ordinary or alternative neuropathy — other diagnoses and “why not PACVS” reasoning

Gold review shows how different LLMs apply the same rubric to the same V-safe story.

Themes you will see as field tabs

Timing and vaccine context (manufacturer, dose, days from vaccine to onset)
Neuropathy pattern (small-fiber, patchy, facial/cranial, burning, and similar)
Multi-system cluster (fatigue, brain fog, dysautonomia, headache)
Severity and course
Objections and alternative causes
Compensation recommendation (separate clinical judgment—not auto-set from the score)
Overall likelihood (1–10) and is_pacvs

PACVS heuristic in cell

Heuristic score (why explanations look like “3/3 temporal”)

A transparent point system (max 10) drives overall_pacvs_likelihood and is_pacvs (typically ≥ 7 means PACVS on current rules). Points come from, for example:

Strong temporal link (onset within 42 days of the relevant dose)
Patchy or facial/cranial pattern
Multi-system cluster (two or more of fatigue, brain fog, dysautonomia, headache)
Other common causes ruled out

Models often show this breakdown in the blue pacvs-summary line and score blocks inside the cell explanation. You still read citations to confirm the story supports the points.

Key takeaway: PACVS here is a structured, auditable rubric—not a single chat answer. EXACT fields let you see value, reasoning, and source lines for each piece of that rubric.