The best app for learning PDF Document AI

In my opinion, Zotero is the best app for learning PDF Document AI.

Here are some reasons why:

Zotero is Free

Zotero itself is completely free software with no paid tiers or subscription requirements. The core application, browser connectors, and storage of metadata and files can all be used without cost.

Zotero is Fully Local

The entire system operates locally on the user’s computer. By default, PDFs and all metadata are stored on the local drive rather than in the cloud. Optional cloud sync is available through third-party services (WebDAV), but it is not required and can be avoided entirely for a 100% local setup.

Zotero Can Easily Handle Thousands of PDF Files

Zotero reliably handles libraries containing thousands of PDF files without performance degradation. Users routinely manage 10,000–50,000 items (many with attached multi-hundred-page PDFs) on standard hardware. Full-text indexing remains fast, and the interface stays responsive.

Zotero Can Also Handle File Types Like Audio and Video Transcripts

Zotero accepts any file type as an attachment. Common use cases include attaching .txt or .vtt transcripts from lectures, interviews, podcasts, or videos alongside the metadata about the original audio/video files. Full-text search indexes the transcript content, and annotations, notes, and tags can be applied directly to these files in the same way as PDFs. You can also attach PDF versions of the transcripts, which would make it easy to browse the transcript inside Zotero itself.

Zotero is Useful Even Without Any AI

Even without integrating any AI tools, Zotero provides substantial value as a standalone reference manager. It extracts metadata from PDFs automatically, renames files according to user-defined patterns, organizes items into collections, supports tags and notes, generates citations and bibliographies, and offers full-text search across the entire library.

In other words, even if your AI experiments do not work out as well as you want, you will still be left with a solid app which immense practical value, so your learning process will not go waste.

Zotero is Open Source

Zotero is open source (AGPL license). The source code for the desktop application, connectors, translators, and related tools is publicly available on GitHub, allowing inspection, modification, and independent verification.

Zotero Data Can Be Easily Exported into an External Tool – And Still Maintain Deep Links to the Zotero PDF Pages

When exporting data for use with external AI tools, Zotero preserves deep links back to specific pages within PDFs. Specifically for PDF files, you can even use clickable zotero://select links that open the exact source item and specific page inside Zotero’s builtin PDF viewer.

Building an External FastAPI App for the Full Document AI Stack

In my PDF Document AI using Zotero course, we will build a web application using FastAPI.

The app ingests exported Zotero data based on the SQLite database and will implement a full Document AI stack.

Note: All heavy inference tasks will be performed via external APIs (primarily OpenRouter), which provides access to frontier models without requiring local GPU resources. I will not be covering local LLMs in this course since they are not yet at the level of quality expected for document AI tasks.

Full-Text Search

Keyword-based search using Whoosh for fast, exact-match queries across the entire corpus.

Note: full text search is not considered AI. But this feature is missing inside Zotero itself as of this writing, so building this in the app will still be a useful exercise. 

Semantic Search

Generate embeddings with sentence-transformers (locally) and store them in FAISS. This will allow you to run similarity searches over your existing PDF files.

Retrieval-Augmented Generation (RAG)

Combine full-text and semantic retrieval results as context and send them to models via OpenRouter (e.g., Claude 3.5 Sonnet, GPT-4o, or Mixtral) for high-quality responses.

This will generate the “AI Summary Search” responses that you now see inside ChatGPT, Gemini, Grok etc.

Structured Data Extraction (Pydantic-Based) from PDFs

Use Pydantic models to define and validate structured output schemas. Extract fields from PDFs (e.g., FOIA-released Case Report Forms) with OpenRouter based structured output endpoints, then store results in a database like PostgreSQL or SQLite.

Extract and chunk text, tables, and images from multi-page PDFs using PyMuPDF, pdfplumber etc. OCR for scanned documents via Tesseract (local) when needed.

In turn, you can use the extracted data in the database to power custom dashboards – for example summary views of CRF forms.

Agentic AI for PDF Tasks

Build AI agents that operate on the Pydantic-structured data and retrieved MCP chunks.

Agents autonomously decide when to retrieve, when to call structured extraction tools, and when to summarize (e.g., “Give me a complete patient summary for ID 004 from all attached CRFs”). All these LLM calls will likely use the OpenRouter API.

This combination of features makes Zotero an ideal app for learning how a local PDF Document AI pipeline works, and depending on the size of your Zotero library, you can do it for very little cost even if you are using state of the art LLMs.

Leave a Reply