draft-reasoned_extraction_udemy


Crash Course
Welcome

Welcome

In about one hour you will evaluate LLMs on real V-safe neuropathy narratives using browser review tools—not generic chat demos.

You will compare DeepSeek v4 to closed-weight models (Claude, GPT, Gemini) with citation-backed tables, a three-model gold panel, and field-level summaries.

Prerequisites: a browser; basic JSON fields (value, explanation, citation). No programming required for this track—the tools run in the browser. Building the full pipeline from scratch involves code; this course focuses on the visual workflow and how comparison works.

Demo — Gold Review at a Glance

Demo — Gold Review at a Glance

Gold review overview

Open gold_review.html first—this is the main place you see model outputs side by side.

What you are looking at

  • Three gold columns — reference runs (Claude Opus, GPT-5.5, Gemini Pro in the default bundle).
  • One test column — your candidate (for example DeepSeek v4 Flash), chosen from the toolbar.
  • Each row — one registrant; the active field tab shows how every model answered that PACVS field for that person.

Screen regions

Region Role
Toolbar Test model, row filters, live stats
Field sidebar PACVS fields; orange marks fields with test mismatches vs agreeing gold
Comparison table Gold + test cells (value, explanation, citation pills)
Markdown pane Source narrative; pills jump to numbered cite rows

The UI is read-only: it loads assessments in the browser and does not edit files on disk.

Later lessons define EXACT and RIGOUR and go deeper on filters and citations. For now, scan the layout and notice how much is visible in one screen.

Key takeaway: Gold review is the visual home base—three gold models vs one test, one field at a time.

The Four Review Pages

The Four Review Pages

Page Role
gold_review.html 3 gold + 1 test — per registrant (demo above)
comparison.html Pick model pairs → Summary or Detailed
summary.html Field-level counts vs gold (two models side by side)
llm_compare/llm_compare.html Left vs right EXACT cells, registrant by registrant

All four read the same assessments and source markdown. The Appendix — Background summarizes V-safe and PACVS if those terms are new.

For code and bundle generation, see the full Reasoned Extraction course—or contact the instructor for your use case.

Cost, Quality, and the Gold Panel

Cost, Quality, and the Gold Panel

Inference cost: DeepSeek v4–class models are often roughly 10× cheaper than the closed-weight references in the gold panel (OpenAI GPT, Anthropic Claude, Google Gemini Pro). At scale, that gap matters.

This course does not claim DeepSeek is perfect. Three closed-weight models are the gold panel; DeepSeek (or any candidate) is the test scored against them—that encodes a higher-trust reference tier, not “cheapest wins.”

Trade-off you will measure

  • Gold tier — stronger baseline (gold can still split on ambiguous cases).
  • Test tier — may diverge; summary.html and gold review show where and how often.

You may accept slightly lower alignment to gold for large savings when citation-backed review shows the gap is acceptable for your workflow.

Key takeaway: Decide if cheaper is good enough using gold-aligned metrics—not marketing.

EXACT & Gold Panel
Reasoned Extraction and EXACT

Reasoned Extraction and EXACT

Reasoned extraction means each PACVS field returns value, explanation, and citation—not a bare label.

EXACT — Explainable eXtraction Assessment using Citation Tables:

Letter Term
E Explainable (explanation on every field)
X eXtraction → PACVSNeuropathyCase JSON
A Assessment (PACVS rubric, including is_pacvs)
C Citation (indices into numbered source rows)
T Tables — source cite table + review grid + summary.html rollup
"is_pacvs": {
  "value": true,
  "explanation": "<p class='pacvs-summary'>…</p>…",
  "citation": [3, 7]
}

Key takeaway: Audit the citation table, not only the boolean.

Pipeline in One Pass

Pipeline in One Pass

  1. Manifest — fixed registrant set.
  2. V-safe markdown — narrative + numbered source citation table (column 1 = cite #).
  3. LLM extraction*.assessment.json per model and registrant.
  4. Review UIs — gold review, comparison index, summary, detailed compare.
  5. Human review — filters, row colors, citation pills.

Default gold models: Claude Opus, GPT-5.5, Gemini. Test / left / right: DeepSeek v4 and other runs in the bundle.

Key takeaway: This crash course is steps 4–5 on the live bundle.

RIGOUR: Three Gold Models and One Test

RIGOUR: Three Gold Models and One Test

Gold vs test columns

RIGOUR names the evaluation discipline that pairs with EXACT (what you inspect in each cell):

Letter Term Meaning in this project
R Reasoned Gold and test outputs are reasoned extractions (value, explanation, citation)—not bare labels.
I Inference Each model infers PACVS fields from the same registrant narrative.
G Gold-standard A fixed panel of three reference models (Claude Opus, GPT-5.5, Gemini in the default bundle).
O Outcomes Per-field outcomes on each registrant: agree, split, majority match, not comparable.
U Under The test model is scored under that panel—one candidate column vs three gold columns.
R Review Human review in gold_review.html and summary.html: filters, row colors, citation checks.

Full expansion: Reasoned Inference with Gold-standard Outcomes Under Review.

How it works in gold review

  • Gold agree — all three gold values match → strong baseline.
  • Gold split — gold disagree → check whether the test matches 2-of-3 majority.
  • No majority — gold tie; read citations manually (gray / dedicated filter).
  • Change only the test dropdown; gold columns stay fixed.

Short values (≤ ~20 characters in gold review) auto-compare; long text stays visible but may be not comparable (gray rows). The same agree / split / majority buckets appear as counts on summary.html.

Key takeaway: EXACT is what you read in each cell; RIGOUR is how you judge a test model against the gold panel.

Gold Review
EXACT Cells and Citations

EXACT Cells and Citations

Markdown and citations

Back in gold_review.html (demo lesson): each table cell is one EXACT record.

  • Value (bold)
  • Explanation (often PACVS heuristic HTML; blue pacvs-summary when present)
  • Citation pills → numbered rows in the markdown pane
  1. Click a registrant heading → load that person’s source narrative.
  2. Click a pill → jump to cite-{slug}-{n} and highlight the row.

Key takeaway: Disagreements are settled in the source table, not by trusting prose alone.

Filters and Row Colors

Filters and Row Colors

Toolbar filters

Filter When to use
Gold agree · test differs First pass on a new model
Gold disagree · test = majority Gold split; test tracks plurality
Gold disagree · test ≠ majority Test outlier on a split
No majority Tie — read citations manually
Color Meaning
Green Gold agree · test matches
Red Gold agree · test differs
Yellow Gold split
Gray Not comparable

Key takeaway: Red fixes the test model; yellow fixes your reading of the chart.

Hands-On: Five-Minute Gold Review

Hands-On: Five-Minute Gold Review

Field table

  1. Set test to DeepSeek v4 Flash (or your candidate).
  2. Open is_pacvs → filter Gold agree · test differs.
  3. For each row: compare cells → click citation pills.
  4. Repeat recommend_compensation.
  5. Note field tabs with — weakest dimensions.

Key takeaway: Filtered gold review beats reading raw JSON.

Compare & Summary
comparison.html and summary.html

comparison.html and summary.html

Comparison index

comparison.html — pick a model pair. Each row offers:

Link Page Question
Summary summary.html How often does each model match gold by field?
Detailed llm_compare.html Who differs, with full EXACT cells?

summary.html — Model A vs Model B, one row per PACVS field:

  • Agree+match / Agree+diff — gold unanimous
  • Split=maj / Split≠maj — gold split vs test
  • Same gold panel as gold review; counts across the whole manifest

Optional: ?left=deepseek__deepseek-v4-flash&right=openai__gpt-5.5.

Key takeaway: Summary before detailed — find weak fields in seconds.

Summary table

Detailed Compare and Which Tool When

Detailed Compare and Which Tool When

LLM compare toolbar

Detailed compare — left vs right (no gold columns):

  • Filters: Differ, Match, No file
  • Field tabs with where models disagree
Tool Best for
Summary Field-level scoreboard (two models)
Detailed Registrant drill-down, Differ filter
Gold review Official test vs 3-gold panel
Comparison index Navigation

Key takeaway: Summary = where; Detailed = who; Gold review = test vs gold on each person.

DeepSeek vs Closed-Weight: One Workflow

DeepSeek vs Closed-Weight: One Workflow

Differ vs match

  1. comparison.html → DeepSeek vs GPT (or Claude).
  2. Summary — fields with high Agree+diff / Split≠maj on DeepSeek.
  3. DetailedDiffer on those fields; verify citations.
  4. Gold review — same model as test; confirm Gold agree · test differs.

Show both aggregate counts and one cite-backed example when you report results.

Key takeaway: Never compare models on labels alone—use summary + citations.

Wrap-Up
Gold Split and Common Mistakes

Gold Split and Common Mistakes

Gold split

Gold split (yellow): gold models disagree. Check whether test = 2-of-3 majority before blaming the test model.

Mistake Fix
Value-only judgment Read explanation + citations
Skipping summary Run field rollup first
Treating yellow as “test wrong” Adjudicate the narrative
Ignoring gray rows Long fields still matter qualitatively

Key takeaway: Same majority logic in gold review rows and summary columns.

End-to-End Checklist and Quick Reference

End-to-End Checklist and Quick Reference

Match vs mismatch

Checklist

  1. comparison.html — both models have runs.
  2. summary.html — weak fields by count.
  3. Detailed compare — Differ + citation check.
  4. gold_review.htmltest model, Gold agree · test differs.
  5. Record field, registrant, and cite row for each finding.

Quick reference

Tool Purpose
gold_review.html 3 gold + 1 test
comparison.html Pair picker
summary.html Field vs gold counts
llm_compare.html Left vs right EXACT

EXACT: explainable JSON → citation tables in the UI.

When reporting results to stakeholders, pair cost (e.g. ~10× lower inference for DeepSeek-class models vs the gold-tier APIs) with measured alignment (summary counts, gold agree · test differs, citation spot-checks)—not either one alone.

For code, PACVS rubric detail, and V-safe narratives, continue with the full Reasoned Extraction lesson set when you are ready to build or extend the bundle. For a specific domain or deployment, contact the instructor to adapt this evaluation pattern to your use case.

Appendix — Background
V-safe Overview

V-safe Overview

V-safe is the CDC’s smartphone program for active monitoring of COVID-19 vaccine safety. After vaccination, participants can report symptoms and follow-ups over time.

This course uses a small research sample of de-identified registrant stories related to neuropathy—not the live V-safe system and not the full national database.

What you see in the review tools

  • Each registrant has a stable study code (shown as the row heading in gold review).
  • The right-hand pane shows that person’s narrative—the same text models read during extraction.
  • Numbered rows in the narrative are the source citation table. Citation pills in model cells jump to those row numbers so you can check whether a label matches the story.

What stories usually mention

  • Vaccine product, dose, and timing
  • Days from vaccination to symptom onset
  • Neuropathy quality (tingling, burning, patchy vs widespread, facial involvement)
  • Other symptoms (fatigue, brain fog, headache, dysautonomia-like features)
  • Competing explanations (prior COVID, diabetes, B12, thyroid, and similar)

Structured PACVS fields (Appendix lesson on PACVS) are an interpretation of this narrative. The narrative remains the source of truth when models disagree.

Scope limits

  • Study codes only—no re-identification of individuals.
  • Research and model comparison—not clinical decision support or official pharmacovigilance workflow.
PACVS Overview

PACVS Overview

PACVS (Post-Acute COVID-Vaccination Syndrome) is a clinical pattern used here to classify certain neuropathy presentations after COVID-19 vaccination—especially when symptoms cluster with other systemic features and onset is close in time to a dose.

The review tools show one PACVS assessment per model per registrant: many fields (timing, pattern, cluster symptoms, objections, compensation stance) plus a top-level is_pacvs judgment.

Distinctions the rubric keeps in play

  • PACVS neuropathy (is_pacvs true) — meets the project threshold on structured criteria
  • Vaccine-related neuropathy, not full PACVS — injury after vaccine without the full pattern
  • Overlap but low likelihood — some features, not a clear case
  • Ordinary or alternative neuropathy — other diagnoses and “why not PACVS” reasoning

Gold review shows how different LLMs apply the same rubric to the same V-safe story.

Themes you will see as field tabs

  • Timing and vaccine context (manufacturer, dose, days from vaccine to onset)
  • Neuropathy pattern (small-fiber, patchy, facial/cranial, burning, and similar)
  • Multi-system cluster (fatigue, brain fog, dysautonomia, headache)
  • Severity and course
  • Objections and alternative causes
  • Compensation recommendation (separate clinical judgment—not auto-set from the score)
  • Overall likelihood (1–10) and is_pacvs

PACVS heuristic in cell

Heuristic score (why explanations look like “3/3 temporal”)

A transparent point system (max 10) drives overall_pacvs_likelihood and is_pacvs (typically ≥ 7 means PACVS on current rules). Points come from, for example:

  • Strong temporal link (onset within 42 days of the relevant dose)
  • Patchy or facial/cranial pattern
  • Multi-system cluster (two or more of fatigue, brain fog, dysautonomia, headache)
  • Other common causes ruled out

Models often show this breakdown in the blue pacvs-summary line and score blocks inside the cell explanation. You still read citations to confirm the story supports the points.

Key takeaway: PACVS here is a structured, auditable rubric—not a single chat answer. EXACT fields let you see value, reasoning, and source lines for each piece of that rubric.