Video Transcript RAG Search

MediaJots Overview
What MediaJots Is

What MediaJots Is

  • How it is built — Python backend + pywebview bridge + local ui/index.html frontend.
  • What it helps with — turning long audio/video into something you can read and work with: transcript, summary, and structured OutScript.
  • What you can do in-app — tabs for OutScript workflow, Podcast RSS ingest, Keyword Search, AI Search, History, and Settings.
  • Why it feels fast after first run — transcript packages, metadata, and read-state are cached in SQLite.
End-to-End Workflow in the OutScript Tab

End-to-end Workflow in the OutScript Tab

  1. Get Details — URL is normalized; media stream and metadata are resolved (title, duration, audio URL/ext, platform behavior).
  2. Prepare/Transcribe — audio is staged locally, uploaded to AssemblyAI, and polled until transcript completion.
  3. Hydrate transcript package — app stores raw transcript, sentence timing JSON, utterance/speaker-derived data, and cache metadata.
  4. Generate with LLM — OpenRouter transforms transcript into OutScript JSON + markdown article + plain-language summary.
  5. Reopen from cache — subsequent visits to the same source URL can load details/transcript/outscript directly from video_cache.

In practice, this means the first run does the heavy lifting, and repeat visits are much quicker.

result = api.get_video_details(video_url)
result = api.transcribe_audio_url(src, audio_url, title, description, stage_id)
llm = api.process_transcript_with_llm(src, transcript_text, title, description, sentences_json, speaker_names_json)
Data Model and Persistence Behavior

Data Model and Persistence Behavior

  • app_settings keeps provider keys and model choice (assembly_ai_api_key, openrouter_api_key, openrouter_model).
  • video_cache keeps source URL, media metadata, transcript payloads, OutScript JSON, summary text, read state, and estimated time saved.
  • transcript_minute_index + transcript_minute_fts power exact-term minute search.
  • ai_search_chunks stores embedding chunks for semantic retrieval and RAG answers.
  • Cache behavior uses canonical URLs and upserts, so you do not end up with duplicate rows for the same media.
OutScript Generation and Speaker-Aware Structure

OutScript Generation and Speaker-aware Structure

  • Structured output — OutScript is validated with typed models and normalized before save.
  • Speaker-aware output — diarization hints and speaker-map reconciliation are applied before final rendering.
  • Safe fallback path — if strict model output fails, the app falls back to transcript-derived structure so you still get usable output.
  • Editable results — speaker names and derived markdown/sections can be saved back with save_processed_data.
Search Capabilities

Search Capabilities

  • Keyword Search (lexical):
  • Builds minute rows from sentence timestamps (or word-count fallback).
  • Uses FTS5 when available, with SQL LIKE fallback when not.
  • Returns seek offsets so users can jump back into transcript context.
  • AI Search (semantic + RAG):
  • Embeds minute chunks with OpenRouter embeddings.
  • Retrieves top similar chunks via sqlite-vec or Python cosine fallback.
  • Generates grounded answers with strict JSON validation and source citations.

Use both together: keyword search for exact terms, AI search for intent-level questions.

kw = api.search_keyword_minutes(query="vector", platform="YouTube", limit=300)
ai = api.ai_search_query("Where does the speaker explain vector fallback?")
Additional Platform Features

Additional Platform Features

  • Podcast RSS ingest with episode listing and "Add to OutScript" handoff.
  • History tools with open/delete actions, bulk clear, and read tracking.
  • Read/time-saved metrics based on transcript/OutScript reading-time estimates.
  • Clear status and error messaging across extraction, transcription, indexing, and LLM calls.
  • Settings-driven controls including keyword and AI index rebuild triggers.
How This Chapter Connects to Later Chapters

How This Chapter Connects to Later Chapters

  • Chapter 1 (this file) gives the system-level map.
  • Chapter 2 (How to Use MediaJots) is the practical user walkthrough.
  • Chapter 3 (Keyword Search) deep-dives lexical indexing and minute retrieval internals.
  • Chapter 4 (AI Search) deep-dives embeddings, vector retrieval, and RAG answer generation.

Use this chapter as the orientation layer before the feature-specific chapters.

How to Use MediaJots
Setup and Required Keys

Setup and Required Keys

Settings tab

Before running workflows, open the Settings tab and configure:

  • assembly_ai_api_key for transcription
  • openrouter_api_key for summary/OutScript and AI Search
  • openrouter_model for LLM generation

After saving settings, use the index buttons when needed:

  • Build Keyword Index
  • Build AI Search Index

This ensures search tabs are ready after you process content.

Process a Video End-to-End

Process a Video End-to-End

Use the OutScript tab in this order:

  1. Paste a supported video URL.
  2. Click Get Details to resolve media metadata and stream URL.
  3. Click Generate OutScript to run transcription + LLM processing.
  4. Review transcript, summary, and OutScript output.
  5. Click Save to persist final edited state.

The app automatically reuses cache when the same source URL is reopened.

details = api.get_video_details(video_url)
tx = api.transcribe_audio_url(src, audio_url, title, description, stage_id)
llm = api.process_transcript_with_llm(src, tx["text"], title, description, sentences_json, speaker_names_json)
Use Podcast RSS Input

Use Podcast RSS Input

If your content source is podcast-first:

  1. Open Podcast tab.
  2. Paste RSS feed URL.
  3. Click Load Feed.
  4. Choose an episode and click Add to OutScript.
  5. Continue in OutScript tab with Get Details / Generate OutScript flow.

This bridges feed discovery and the same transcript workflow used for video URLs.

Review and Edit Speaker-Aware Output

Review and Edit Speaker-Aware Output

After generation:

  • Verify transcript quality in the transcript panel.
  • Review summary for clarity and accuracy.
  • Inspect OutScript sections, timestamps, and speaker labeling.
  • Adjust speaker names where needed.
  • Save processed data to persist edits.

Use this step to make the OutScript accurate and easy to revisit later.

Run Keyword Search in Practice

Run Keyword Search in Practice

Use Keyword Search when you need exact-term lookup:

  1. Rebuild minute index (if content changed).
  2. Enter query and optional platform filter.
  3. Open a result via Open Transcript.
  4. Continue analysis in OutScript tab at the relevant moment.
api.rebuild_keyword_index()
rows = api.search_keyword_minutes("speaker labels", "YouTube", 300)
Run AI Search in Practice

Run AI Search in Practice

Use AI Search for intent-level questions:

  1. Rebuild AI index after new transcript content is added.
  2. Ask a natural-language question.
  3. Read answer + citations.
  4. Inspect retrieved source chunks for verification.
api.rebuild_ai_search_index()
answer = api.ai_search_query("What are the main optimization strategies discussed?")
Use History and Read Tracking

Use History and Read Tracking

The History tab helps you manage processed assets:

  • reopen prior items quickly
  • delete individual transcript packages
  • clear all transcript cache when needed
  • track read state and estimated time saved

Mark transcripts as read when you finish reviewing to keep progress meaningful.

Recommended Daily Workflow

Recommended Daily Workflow

For consistent output quality:

  1. Confirm settings/API keys.
  2. Process new media in OutScript tab.
  3. Edit speakers and save final output.
  4. Rebuild Keyword + AI indexes.
  5. Use searches to revisit key moments and reinforce understanding.
  6. Mark transcripts as read to track your learning progress.

This sequence keeps your archive searchable and your learning progress easy to track.

Keyword Search
What Keyword Search Solves

Why Keyword Search Exists in MediaJots

Keyword Search is your fast lookup layer for media you already transcribed. Instead of scanning full transcripts, you jump straight to minute-level segments that match what you typed.

In day-to-day use, it solves three recurring problems:

  • Recall: "Where did this speaker mention X?"
  • Navigation: "Open the transcript near that moment."
  • Cross-item lookup: "Find matches across all cached videos/podcasts."

This is intentionally separate from AI Search:

  • Keyword Search is lexical (FTS/LIKE over text rows).
  • AI Search is semantic (embedding vectors + RAG answer generation).

This chapter is all about how that keyword path works.

result = api.search_keyword_minutes("speaker labels", "YouTube", 300)
Data Prerequisites and Source of Truth

What Must Exist Before Keyword Search Works

Keyword Search only indexes what is already in video_cache. If raw_transcript_text is empty, there is nothing to index.

The primary source query comes from AppDatabase.list_video_cache_for_keyword_index(), which returns:

  • source_url
  • video_title
  • video_description
  • sentences_json
  • raw_transcript_text
  • speaker_names_json
  • updated_at

So the indexing pipeline assumes transcript generation already happened in the OutScript workflow.

Important detail:

  • If a user has never transcribed media, index rebuild succeeds structurally but produces zero searchable minute rows.
Minute-Level Chunking Strategy

How MediaJots Converts a Transcript into Searchable Minute Rows

The core transformation lives in Api._minute_rows_for_item(item).

Preferred path: sentence timestamps

If sentences_json is available, MediaJots:

  1. Iterates sentence records.
  2. Reads each sentence text and start timestamp (milliseconds).
  3. Buckets sentence text into minute = start // 60000.
  4. Optionally prefixes speaker name/label ([Speaker X] ...) using speaker_names_json.
  5. Joins all sentence text in the same minute into one transcript chunk.

For each minute bucket, it emits one row containing:

  • platform
  • title
  • description
  • minute label (h:mm)
  • updated timestamp
  • transcript chunk
  • source URL

Fallback path: no sentence timing

If sentence-level JSON is missing, it falls back to simple transcript slicing:

  • Splits raw_transcript_text into words.
  • Groups by 160 words per synthetic minute.
  • Emits rows with minute = index // 160.

That fallback keeps search usable even when timing metadata is imperfect.

Index Rebuild Flow (Backend API)

What Happens When You Click "Rebuild Minute Index"

UI triggers window.pywebview.api.rebuild_keyword_index(), mapped to Api.rebuild_keyword_index().

Backend flow:

  1. Fetch all eligible transcript cache rows via list_video_cache_for_keyword_index().
  2. Expand each item into minute rows using _minute_rows_for_item.
  3. Persist rows with db.rebuild_transcript_minute_index(rows).
  4. Fetch platform filter options with db.list_keyword_platforms().
  5. Return structured status:
  • success flag
  • message (Indexed N minute row(s).)
  • indexed row count
  • platform list

rebuild_transcript_minute_index() is a full refresh, not incremental:

  • Deletes existing transcript_minute_index
  • Deletes existing transcript_minute_fts (when available)
  • Reinserts all normalized rows
  • Rebuilds FTS mirror rows aligned by rowid

Practically, this keeps results consistent after transcript edits or cache changes.

items = db.list_video_cache_for_keyword_index()
rows = [r for item in items for r in api._minute_rows_for_item(item)]
inserted = db.rebuild_transcript_minute_index(rows)
SQLite Storage Model and FTS Design

How Keyword Data Is Stored for Fast Lookup

MediaJots uses two synchronized SQLite tables for keyword retrieval:

  • transcript_minute_index (authoritative structured rows)
  • transcript_minute_fts (FTS5 virtual table for full-text matching)

Structured table (transcript_minute_index)

Stores:

  • metadata (platform, title, description, source_url, updated_at)
  • navigation fields (minute_start, minute_label)
  • searchable body (transcript)

minute_start is normalized as total minutes from h:mm for sort and seek calculations.

Full-text table (transcript_minute_fts)

Created with FTS5 when available and indexed on:

  • platform, title, description, transcript

Fields like source_url and minute_label are unindexed payload columns (kept for display/navigation joins).

The goal here is simple:

  • Use FTS when available for better speed and matching.
  • Keep a fallback path for environments where FTS is unavailable.
Query Execution Path and Fallback Logic

How a Search Query Is Executed

UI call: window.pywebview.api.search_keyword_minutes(query, platform, limit)

Backend call chain:

Api.search_keyword_minutes(...) -> AppDatabase.search_transcript_minute_index(...)

Query behavior

  • query is trimmed text input.
  • platform is optional exact filter.
  • limit is clamped to [1, 2000] (UI typically passes 300).

Two query modes

  1. With query text
  • Try FTS5:
  • transcript_minute_fts MATCH ?
  • Join to transcript_minute_index for full row data
  • If FTS unavailable, fallback to SQL LIKE across:
  • platform
  • title
  • description
  • transcript
  1. Without query text
  • Return recent index rows (optionally filtered by platform), sorted and limited.

Response row shape

Each result includes:

  • platform, title, description
  • minute_label, updated_at
  • transcript, source_url
  • minute_start_minutes
  • computed seek_offset_ms = minute_start 60 1000

That seek_offset_ms is the key that enables jump-to-moment navigation.

SELECT i.title, i.minute_label, i.transcript
FROM transcript_minute_fts f
JOIN transcript_minute_index i ON i.id = f.rowid
WHERE transcript_minute_fts MATCH ?
LIMIT 300;
Frontend UX and Interaction Lifecycle

Keyword Search UI Behavior in ui/index.html

The keywordView tab contains:

  • rebuild button (keywordRefreshBtn)
  • platform filter dropdown (keywordPlatformFilter)
  • search input (keywordQueryInput)
  • search button (keywordSearchBtn)
  • result count + results list

Rebuild lifecycle

rebuildKeywordIndex():

  • disables rebuild button
  • shows "Building minute index…"
  • calls backend rebuild API
  • refreshes platform filter options
  • clears current result list
  • restores button state

Search lifecycle

searchKeywordIndex():

  • validates non-empty query
  • shows "Searching keyword index…"
  • calls backend search API
  • renders rows with highlighted query matches
  • updates count and status text

Result item rendering

renderKeywordResults() shows for each row:

  • title
  • platform + minute + updated time
  • description snippet
  • transcript snippet
  • action button: Open Transcript

When you click Open, openKeywordInWorkflow(source_url, seek_offset_ms) switches to the main video workflow and opens the media near that exact minute.

Seek and Navigation Integration

How Keyword Search Connects Back to OutScript Workflow

Keyword Search is not a standalone viewer; it is a discovery layer that routes you back into the main media/transcript workflow.

openKeywordInWorkflow(sourceUrl, seekOffsetMs) does the following:

  1. Stores pending seek offset (pendingOutscriptSeekMs).
  2. Activates the main "video" tab.
  3. Writes source_url into the workflow URL input.
  4. Calls getVideoDetails() to load cached/extracted media state.
  5. Lets downstream workflow components scroll or seek to that timestamp.

This creates a smooth handoff:

  • Search result -> exact source media -> contextual transcript/outscript navigation.

The practical UX is: find first, then deep-read in the right place.

Reliability, Edge Cases, and Performance Notes

Practical Edge Cases and Why the Design Holds Up

1) Missing timing metadata

When sentence timestamps are unavailable, the word-count fallback still lets indexing run. Precision is lower, but discoverability remains.

2) Environments without SQLite FTS5

Search automatically falls back to LIKE queries. That keeps functionality available across more Python/SQLite builds, with some trade-off in performance and matching quality.

3) Speaker labeling quality

If diarization labels exist, minute chunks include bracketed speaker markers. This improves query utility for terms like names or role-specific references.

4) Rebuild consistency

Index rebuild uses full replace. That avoids stale fragments when transcript content changes.

5) Result ordering

Rows are sorted by:

  • newest updated_at first,
  • then title,
  • then minute order.

So you see recent media first, while preserving timeline order inside each item.

End-to-End Execution Trace

Full Request Trace: From Button Click to Clickable Result

  1. User transcribes media in OutScript workflow (data lands in video_cache).
  2. User opens Keyword Search tab.
  3. User clicks Rebuild Minute Index.
  4. Backend expands transcript content to minute rows and rebuilds SQLite index tables.
  5. User enters query + optional platform filter.
  6. Backend executes FTS5 (or LIKE fallback) and returns matching minute rows.
  7. UI renders highlighted snippets and counts.
  8. User clicks Open Transcript on a match.
  9. App reopens source media in workflow context with minute seek offset.

That loop is the heart of Keyword Search in MediaJots: index once, query fast, jump to context.

AI Search
What AI Search Is Designed to Do

Why AI Search Exists Alongside Keyword Search

AI Search in MediaJots is the semantic layer for your transcript archive. Instead of exact text matching, it finds conceptually related chunks and then builds a grounded answer with citations.

The feature solves questions like:

  • "What are the main arguments about topic X across my archive?"
  • "Where did speakers discuss this idea, even if they used different wording?"
  • "Give me a concise answer and show which transcript chunks support it."

Compared to Keyword Search:

  • Keyword Search: lexical lookup over minute text.
  • AI Search: vector similarity + LLM response constrained to retrieved sources.
Core Architecture and Components

High-Level AI Search Stack in MediaJots

AI Search is split into three layers:

  1. Data preparation + embeddings (api.py + ai_search.py)
  2. Vector/chunk storage + nearest-neighbor retrieval (db.py)
  3. RAG answer generation + UI rendering (ai_search.py + ui/index.html)

Main backend entry points

  • Api.rebuild_ai_search_index()
  • Api.ai_search_status()
  • Api.ai_search_query(question)

External model/API usage

  • OpenRouter embeddings endpoint for chunk vectors
  • OpenRouter chat endpoint for final grounded JSON answer

Persistence layer

  • ai_search_chunks SQLite table stores chunk metadata + serialized vectors
  • Optional sqlite-vec acceleration if available
  • Pure-Python cosine fallback if vector extension is unavailable

That design keeps AI Search working both in high-performance setups and in more minimal environments.

status = api.ai_search_status()
result = api.ai_search_query("How is cosine fallback handled?")
Index Build Prerequisites and Source Rows

What Data AI Search Reuses from the Transcript Pipeline

Just like Keyword Search, AI Search starts from transcript cache entries you already generated.

Api.rebuild_ai_search_index() calls:

  • db.load_settings() to fetch API key/model settings
  • db.list_video_cache_for_keyword_index() to fetch transcript-bearing media rows

It expands each item through _minute_rows_for_item(item) (the same minute-chunking logic used by keyword indexing). So both search systems are built on the same chunk foundation.

Input guarantees before a useful AI index can be built:

  • OpenRouter API key exists in settings
  • At least one cached item has transcript text
  • Minute rows can be generated from sentence timestamps or fallback chunking

If prerequisites fail, the API returns a clear message instead of silently building a broken partial index.

Chunk Construction and Embedding Generation

How MediaJots Builds Embedding Inputs

Inside rebuild_ai_search_index():

  1. For each minute row, read:
  • title
  • transcript chunk body
  • minute label
  • source URL/platform
  1. Create chunk_text from transcript body (trimmed).
  2. Build embedding input as:
  • title + "\n\n" + chunk_text
  1. Trim text length to model-safe limits (<= 7500 chars).

Why include title + chunk text together:

  • It injects topical context into embedding space.
  • Similar questions can retrieve chunks that might not repeat all core terms in the body itself.

Embedding generation is batched in fetch_embeddings_openrouter():

  • model: openai/text-embedding-3-small (default)
  • batch size: 32
  • timeout controls + robust error handling
  • strict vector count validation (must align 1:1 with inputs)

If anything mismatches (or request fails), rebuild stops with a clear error payload.

vectors, err = fetch_embeddings_openrouter(api_key, texts, model="openai/text-embedding-3-small")
inserted = db.replace_ai_search_chunks(db_rows)
Vector Storage and Retrieval Strategy

How Embeddings Are Stored and Queried

Embeddings are persisted in ai_search_chunks with metadata per chunk:

  • source_url
  • seek_offset_ms
  • minute_label
  • platform
  • title
  • chunk_text
  • embedding (JSON serialized float list)
  • updated_at

Rebuild behavior in replace_ai_search_chunks(rows) is full replacement:

  • delete all old rows
  • insert normalized new rows

This keeps the index consistent after transcript updates.

Retrieval path: nearest chunks

search_ai_similar_chunks(query_embedding, k=10):

  • If sqlite-vec is usable:
  • compute cosine distance in SQL (vec_distance_cosine)
  • order ascending distance
  • Else:
  • load vectors into Python
  • compute cosine distance manually
  • sort and take top-k

Returned rows always include distance + navigation metadata used by the UI and citation rendering.

Question Flow and Query Embedding

What Happens When User Asks an AI Search Question

UI calls window.pywebview.api.ai_search_query(question).

Backend validation steps in Api.ai_search_query():

  1. Ensure OpenRouter key exists.
  2. Ensure question text is non-empty.
  3. Ensure AI index has chunks (count_ai_search_chunks() > 0).

Then:

  1. Embed the user question via fetch_embeddings_openrouter(api_key, [question]).
  2. Retrieve top 10 nearest chunks via db.search_ai_similar_chunks(...).
  3. Hydrate speaker labels in chunk text (hydrate_bracketed_diarization_tags) using cached speaker hints from video_cache.
  4. Build ranked source objects with:
  • rank number
  • URL/title/platform/minute metadata
  • chunk text
  • distance

If no similar chunks are found, the call exits early with a clear "No similar chunks found" style message.

RAG Prompting and Structured Answer Contract

How the Final AI Answer Is Generated and Validated

Answer synthesis is handled by run_ai_search_rag(...) in ai_search.py.

Prompt construction

Retrieved chunks are serialized into numbered blocks:

  • SOURCE [1] ...
  • SOURCE [2] ...

Prompt rules force grounding:

  • answer only from supplied sources
  • admit insufficient evidence when needed
  • return strict JSON with:
  • answer (string)
  • citations (list of source numbers)

Schema enforcement

MediaJots uses a Pydantic model (AiSearchAnswerPayload) and sends JSON schema constraints to OpenRouter response_format.

If provider rejects strict schema mode:

  • code retries with generic json_object mode

Returned text is then:

  1. parsed from raw model output
  2. schema-validated
  3. citation-normalized to in-range source IDs only

Only validated output is treated as success.

This greatly reduces malformed responses and keeps citation mapping predictable.

parsed = parse_llm_json_payload(raw)
validated = AiSearchAnswerPayload.model_validate(parsed)
Frontend UX: Status, Sources, and Citations

How AI Search Appears in the Desktop UI

In ui/index.html, AI Search tab behavior includes:

  • rebuild index action (rebuild_ai_search_index)
  • run query action (ai_search_query)
  • status text updates for each phase
  • answer panel
  • source list panel
  • citation panel
  • sqlite-vec capability indicator pill

When query succeeds, UI shows:

  • human-readable answer text
  • ranked source chunks used in retrieval
  • citations that map to source numbers

If query fails, the UI still shows fallback source/citation panels instead of going blank.

That makes it easier to debug retrieval quality vs generation quality.

Error Handling and Safety Guarantees

Why AI Search Fails Predictably (and Recoverably)

The pipeline includes explicit guards at every step:

  • missing API key -> immediate actionable message
  • empty question -> prompt user input
  • empty index -> instruct rebuild
  • embedding request failure -> abort with network/model error detail
  • no top chunks -> explain retrieval miss
  • malformed LLM JSON -> parse-failed response, debug log written
  • schema mismatch -> strict validation failure, debug log written

Debug logging hooks (write_llm_debug_log) help inspect raw LLM output and distinguish:

  • transport/API errors
  • parse/JSON formatting failures
  • schema contract violations

This layered validation matters because AI Search combines multiple probabilistic systems, and each one needs to fail loudly enough for fast recovery.

Performance Characteristics and Tuning Levers

Where Latency Comes From and What Can Be Tuned

Primary cost centers:

  1. Embedding generation during rebuild (network + model throughput)
  2. Vector similarity retrieval (fast with sqlite-vec, slower with Python fallback)
  3. Final chat completion for RAG answer

Main tuning levers visible in code:

  • embedding batch size (EMBED_BATCH_SIZE = 32)
  • max embed chars (MAX_EMBED_CHARS = 7500)
  • retrieval depth (top 10 chunks)
  • chat model from settings (openrouter_model)

Trade-off notes:

  • More retrieved chunks increase context coverage but can dilute relevance.
  • Smaller chunks improve precision but may lose narrative context.
  • sqlite-vec availability materially improves retrieval speed at larger archive sizes.
End-to-End AI Search Execution Trace

Full Flow: From Rebuild Click to Cited Answer

  1. User clicks Rebuild AI Search Index.
  2. Backend reads transcript cache rows and expands minute chunks.
  3. Chunks are embedded via OpenRouter embeddings API.
  4. Existing ai_search_chunks table is replaced with fresh vectors + metadata.
  5. User asks a natural-language question.
  6. Question is embedded into vector space.
  7. Top similar chunks are retrieved (sqlite-vec or Python cosine path).
  8. Ranked chunk set is passed to RAG prompt as numbered SOURCES.
  9. OpenRouter chat model returns structured JSON answer + citations.
  10. Backend validates and normalizes output.
  11. UI renders answer, sources, and citation mapping.

This is the core AI Search contract in MediaJots: semantic retrieve, grounded generate, validated cite.

How to install MediaJots on your computer

Download PyCharm

Unzip the project and open in PyCharm

Run the following command in the Terminal inside PyCharm

pip install -r requirements.txt

Then run this

python app.py

Now follow instructions from the “How to Use MediaJots” chapter