16 KiB

Raw Blame History

Trax v1-v2 PRD: Personal Research Transcription Tool

🎯 Product Vision

"We're building a personal transcription tool that enables researchers to batch-process tech podcasts, academic lectures, and audiobooks by downloading media locally and running high-accuracy transcription, resulting in searchable, structured text content for study and research."

🏗️ System Architecture Overview

Core Components

Data Layer: PostgreSQL with JSONB, SQLAlchemy registry pattern
Business Logic: Protocol-based services, async/await throughout
Interface Layer: CLI-first with Click, batch processing focus
Integration Layer: Download-first architecture, curl-based YouTube metadata

System Boundaries

What's In Scope: Local media processing, YouTube metadata extraction, batch transcription, JSON/TXT export
What's Out of Scope: Real-time streaming, web UI, multi-user support, cloud processing
Integration Points: Whisper API, DeepSeek API, FFmpeg, PostgreSQL, YouTube (curl)

👥 User Profile

Primary User: Personal Researcher

Role: Individual researcher processing educational content
Content Types: Tech podcasts, academic lectures, audiobooks
Workflow: Batch URL collection → Download → Transcribe → Study
Goals: High accuracy transcripts, fast processing, searchable content
Constraints: Local storage, API costs, processing time

🔧 Functional Requirements

Feature 1: YouTube URL Processing

Purpose

Extract metadata from YouTube URLs using curl to avoid API complexity

User Stories

As a researcher, I want to provide YouTube URLs, so that I can get video metadata and download links
As a researcher, I want to batch process multiple URLs, so that I can queue up content for transcription

Acceptance Criteria

Given a YouTube URL, When I run trax youtube <url>, Then I get title, channel, description, and duration
Given a list of URLs, When I run trax batch-urls <file>, Then all metadata is extracted and stored
Given invalid URLs, When I process them, Then clear error messages are shown

Input Validation Rules

URL format: Must be valid YouTube URL - Error: "Invalid YouTube URL"
URL accessibility: Must be publicly accessible - Error: "Video not accessible"
Rate limiting: Max 10 URLs per minute - Error: "Rate limit exceeded"

Business Logic Rules

Rule 1: Use curl with user-agent to avoid blocking
Rule 2: Extract metadata using regex patterns targeting ytInitialPlayerResponse and ytInitialData objects
Rule 3: Store metadata in PostgreSQL for future reference
Rule 4: Generate unique filenames based on video ID and title
Rule 5: Handle escaped characters in titles and descriptions using Perl regex patterns

Error Handling

Network Error: Retry up to 3 times with exponential backoff
Invalid URL: Skip and continue with remaining URLs
Rate Limited: Wait 60 seconds before retrying

Feature 2: Local Media Transcription (v1)

Purpose

High-accuracy transcription of downloaded media files using Whisper

User Stories

As a researcher, I want to transcribe downloaded media, so that I can study the content
As a researcher, I want batch processing, so that I can process multiple files efficiently

Acceptance Criteria

Given a downloaded media file, When I run trax transcribe <file>, Then I get 95%+ accuracy transcript in <30 seconds
Given a folder of media files, When I run trax batch <folder>, Then all files are processed with progress tracking
Given poor audio quality, When I transcribe, Then I get a quality warning with accuracy estimate

Input Validation Rules

File format: mp3, mp4, wav, m4a, webm - Error: "Unsupported format"
File size: ≤500MB - Error: "File too large, max 500MB"
Audio duration: >0.1 seconds - Error: "File too short or silent"

Business Logic Rules

Rule 1: Always download media before processing (no streaming)
Rule 2: Convert audio to 16kHz mono WAV for Whisper
Rule 3: Use distil-large-v3 model with M3 optimizations
Rule 4: Store results in PostgreSQL with JSONB for transcripts

Error Handling

Whisper Memory Error: Implement chunking for files >10 minutes
Audio Quality: Warn user if estimated accuracy <80%
Processing Failure: Save partial results, allow retry from last successful stage

Feature 3: AI Enhancement (v2)

Purpose

Improve transcript accuracy and readability using DeepSeek

User Stories

As a researcher, I want enhanced transcripts, so that technical terms and punctuation are correct
As a researcher, I want to compare original vs enhanced, so that I can verify improvements

Acceptance Criteria

Given a v1 transcript, When I run enhancement, Then accuracy improves to ≥99%
Given an enhanced transcript, When I compare to original, Then no content is lost
Given enhancement fails, When I retry, Then original transcript is preserved

Input Validation Rules

Transcript format: Must be valid JSON with segments - Error: "Invalid transcript format"
Enhancement model: Must be available (DeepSeek API key) - Error: "Enhancement service unavailable"

Business Logic Rules

Rule 1: Preserve timestamps and speaker markers during enhancement
Rule 2: Use structured enhancement prompts for technical content
Rule 3: Cache enhancement results for 7 days to reduce costs

Error Handling

API Rate Limit: Queue enhancement for later processing
Enhancement Failure: Return original transcript with error flag
Content Loss: Validate enhancement preserves original length ±5%

🖥️ User Interface Flows

Flow 1: YouTube URL Processing

Screen 1: URL Input

Purpose: User provides YouTube URLs for processing
Elements:
- URL input: Text input - Single URL or file path
- Batch option: Flag - --batch for multiple URLs
- Output format: Flag - --json, --txt for metadata export
Actions:
- Enter: Process URL → Metadata display
- Ctrl+C: Cancel operation → Return to prompt
Validation: URL format, accessibility, rate limits
Error States: "Invalid URL", "Video not accessible", "Rate limit exceeded"

Screen 2: Metadata Display

Purpose: Show extracted metadata and download options
Elements:
- Video info: Text display - Title, channel, duration
- Download option: Flag - --download to save media
- Queue option: Flag - --queue for batch processing
Actions:
- Download: Save media file → Download progress
- Queue: Add to batch queue → Queue confirmation
- Next: Process another URL → Return to input

Flow 2: Batch Transcription

Screen 1: Batch Command Input

Purpose: User initiates batch transcription
Elements:
- Directory path: Text input - Folder containing media files
- Pipeline version: Flag - --v1, --v2 (default: v1)
- Parallel workers: Flag - --workers (default: 8 for M3 MacBook)
- Quality threshold: Flag - --min-accuracy (default: 80%)
Actions:
- Enter: Start batch processing → Progress tracking
- Preview: Show file list → File list display
Validation: Directory exists, contains supported files

Screen 2: Batch Progress

Purpose: Show real-time processing status
Elements:
- Overall progress: Progress bar - Total batch completion
- Current file: Text display - Currently processing file
- Quality metrics: Text display - Accuracy estimates
- Queue status: Text display - Files remaining, completed, failed
Actions:
- Continue: Automatic progression → Results summary
- Pause: Suspend processing → Resume option
Validation: Updates every 5 seconds, shows quality warnings

Screen 3: Results Summary

Purpose: Show batch processing results
Elements:
- Success count: Text display - Files processed successfully
- Failure count: Text display - Files that failed
- Quality report: Text display - Average accuracy, warnings
- Export options: Buttons - JSON, TXT, SRT formats
Actions:
- Export: Save all transcripts → Export confirmation
- Retry failed: Re-process failed files → Retry progress
- New batch: Start over → Return to input

🔄 Data Flow & State Management

Data Models (PostgreSQL Schema)

YouTubeVideo

{
  "id": "UUID (required, primary key)",
  "youtube_id": "string (required, unique)",
  "title": "string (required)",
  "channel": "string (required)",
  "description": "text (optional)",
  "duration_seconds": "integer (required)",
  "url": "string (required)",
  "metadata_extracted_at": "timestamp (auto-generated)",
  "created_at": "timestamp (auto-generated)"
}

MediaFile

{
  "id": "UUID (required, primary key)",
  "youtube_video_id": "UUID (optional, foreign key)",
  "local_path": "string (required, file location)",
  "media_type": "string (required, mp3, mp4, wav, etc.)",
  "duration_seconds": "integer (optional)",
  "file_size_bytes": "bigint (required)",
  "download_status": "enum (pending, downloading, completed, failed)",
  "created_at": "timestamp (auto-generated)",
  "updated_at": "timestamp (auto-updated)"
}

Transcript

{
  "id": "UUID (required, primary key)",
  "media_file_id": "UUID (required, foreign key)",
  "pipeline_version": "string (required, v1, v2)",
  "raw_content": "JSONB (required, Whisper output)",
  "enhanced_content": "JSONB (optional, AI enhanced)",
  "text_content": "text (required, plain text for search)",
  "model_used": "string (required, whisper model version)",
  "processing_time_ms": "integer (required)",
  "word_count": "integer (required)",
  "accuracy_estimate": "float (optional, 0.0-1.0)",
  "quality_warnings": "string array (optional)",
  "processing_metadata": "JSONB (optional, version-specific data)",
  "created_at": "timestamp (auto-generated)",
  "enhanced_at": "timestamp (optional)",
  "updated_at": "timestamp (auto-updated)"
}

State Transitions

YouTubeVideo State Machine

[url_provided] → [metadata_extracting] → [metadata_complete]
[url_provided] → [metadata_extracting] → [metadata_failed]

MediaFile State Machine

[pending] → [downloading] → [completed]
[pending] → [downloading] → [failed] → [retry] → [downloading]

Transcript State Machine

[processing] → [completed]
[processing] → [failed] → [retry] → [processing]
[completed] → [enhancing] → [enhanced]
[enhanced] → [final]

Data Validation Rules

Rule 1: File size must be ≤500MB for processing
Rule 2: Audio duration must be >0.1 seconds (not silent)
Rule 3: Transcript must contain at least one segment
Rule 4: Processing time must be >0 and <3600 seconds (1 hour)
Rule 5: YouTube ID must be unique in database

🧪 Testing Requirements

Unit Tests

test_youtube_metadata_extractor: Extract metadata using curl and regex patterns
test_media_downloader: Download from various sources
test_audio_preprocessor: Convert to 16kHz mono WAV
test_whisper_service: Basic transcription functionality
test_enhancement_service: AI enhancement with DeepSeek
test_batch_processor: Parallel file processing with error tracking

Integration Tests

test_pipeline_v1: End-to-end v1 transcription
test_pipeline_v2: End-to-end v2 with enhancement
test_batch_processing: Process 10 files in parallel
test_database_operations: PostgreSQL CRUD operations
test_export_formats: JSON and TXT export functionality

Edge Cases

Silent audio file: Should detect and report appropriately
Corrupted media file: Should handle gracefully with clear error
Network interruption during download: Should retry automatically
Large file (>10 minutes): Should chunk automatically
Memory pressure: Should handle gracefully with resource limits
Poor audio quality: Should warn user about accuracy expectations

🚀 Implementation Phases

Phase 1: Core Foundation (Weeks 1-2)

Goal: Basic transcription working with CLI

PostgreSQL database setup with JSONB - Schema created and tested
YouTube metadata extraction with curl - Extract title, channel, description, duration using regex patterns
Basic Whisper integration (v1) - 95% accuracy on test files
Batch processing system - Handle 10+ files in parallel with error tracking
CLI implementation with Click - All commands functional
JSON/TXT export functionality - Both formats working

Phase 2: Enhancement (Week 3)

Goal: AI enhancement working reliably

DeepSeek integration - API calls working with retry logic
Enhancement templates - Structured prompts for technical content
Progress tracking - Real-time updates in CLI
Quality validation - Compare before/after accuracy
Error recovery - Handle API failures gracefully

Phase 3: Roadmap - Multi-Pass Accuracy (v3)

Goal: Multi-pass accuracy improvements

Multi-pass implementation - 3 passes with different parameters
Confidence scoring - Per-segment confidence metrics
Segment merging - Best segment selection algorithm
Performance optimization - 3x speed improvement over v1
Memory management - Handle large files efficiently

Phase 4: Roadmap - Speaker Diarization (v4)

Goal: Speaker diarization and scaling

Speaker diarization - 90% speaker identification accuracy
Voice embedding database - Speaker profile storage
Caching layer - 50% cost reduction through caching
API endpoints - REST API for integration
Production deployment - Monitoring and logging

🔒 Security & Constraints

Security Requirements

API Key Management: Secure storage of Whisper and DeepSeek API keys
File Access: Local file system access only
Data Protection: Encrypted storage for sensitive transcripts
Input Sanitization: Validate all file paths and URLs

Performance Constraints

Response Time: <30 seconds for 5-minute audio (v1)
Throughput: Process 100+ files in batch
Memory Usage: <8GB peak memory usage (M3 MacBook 16GB)
Database Queries: <1 second for transcript retrieval
Parallel Workers: 8 workers for optimal M3 performance

Technical Constraints

File Formats: mp3, mp4, wav, m4a, webm only
File Size: Maximum 500MB per file
Audio Duration: Maximum 2 hours per file
Network: Download-first, no streaming processing
Storage: Local storage required, no cloud-only processing
YouTube: Curl-based metadata extraction only

✅ Definition of Done

Feature Complete

All acceptance criteria met with real test files
Unit tests passing with >80% coverage
Integration tests passing with actual services
Code review completed
Documentation updated in rule files
Performance benchmarks met

Ready for Deployment

Performance targets achieved (speed, accuracy, memory)
Security review completed
Error handling tested with edge cases
User acceptance testing with real files
Rollback plan prepared for each version
Monitoring and logging configured

Trax-Specific Criteria

Follows protocol-based architecture
Uses download-first approach (no streaming)
Implements proper error handling with actionable messages
Maintains backward compatibility across versions
Uses real files in tests (no mocks)
Follows established rule files and patterns
Handles tech podcast and academic lecture content effectively

This PRD is specifically designed for a personal research tool focused on tech podcasts, academic lectures, and audiobooks, with clear v1-v2 implementation and v3-v4 roadmap.

16 KiB Raw Blame History

Trax v1-v2 PRD: Personal Research Transcription Tool

🎯 Product Vision

🏗️ System Architecture Overview

Core Components

System Boundaries

👥 User Profile

Primary User: Personal Researcher

🔧 Functional Requirements

Feature 1: YouTube URL Processing

Purpose

User Stories

Acceptance Criteria

Input Validation Rules

Business Logic Rules

Error Handling

Feature 2: Local Media Transcription (v1)

Purpose

User Stories

Acceptance Criteria

Input Validation Rules

Business Logic Rules

Error Handling

Feature 3: AI Enhancement (v2)

Purpose

User Stories

Acceptance Criteria

Input Validation Rules

Business Logic Rules

Error Handling

🖥️ User Interface Flows

Flow 1: YouTube URL Processing

Screen 1: URL Input

Screen 2: Metadata Display

Flow 2: Batch Transcription

Screen 1: Batch Command Input

Screen 2: Batch Progress

Screen 3: Results Summary

🔄 Data Flow & State Management

Data Models (PostgreSQL Schema)

YouTubeVideo

MediaFile

Transcript

State Transitions

YouTubeVideo State Machine

MediaFile State Machine

Transcript State Machine

Data Validation Rules

🧪 Testing Requirements

Unit Tests

Integration Tests

Edge Cases

🚀 Implementation Phases

Phase 1: Core Foundation (Weeks 1-2)

Phase 2: Enhancement (Week 3)

Phase 3: Roadmap - Multi-Pass Accuracy (v3)

Phase 4: Roadmap - Speaker Diarization (v4)

🔒 Security & Constraints

Security Requirements

Performance Constraints

Technical Constraints

✅ Definition of Done

Feature Complete

Ready for Deployment

Trax-Specific Criteria

16 KiB

Raw Blame History