# Trax v1-v2 PRD: Personal Research Transcription Tool ## 🎯 Product Vision *"We're building a personal transcription tool that enables researchers to batch-process tech podcasts, academic lectures, and audiobooks by downloading media locally and running high-accuracy transcription, resulting in searchable, structured text content for study and research."* ## 🏗️ System Architecture Overview ### Core Components - **Data Layer**: PostgreSQL with JSONB, SQLAlchemy registry pattern - **Business Logic**: Protocol-based services, async/await throughout - **Interface Layer**: CLI-first with Click, batch processing focus - **Integration Layer**: Download-first architecture, curl-based YouTube metadata ### System Boundaries - **What's In Scope**: Local media processing, YouTube metadata extraction, batch transcription, JSON/TXT export - **What's Out of Scope**: Real-time streaming, web UI, multi-user support, cloud processing - **Integration Points**: Whisper API, DeepSeek API, FFmpeg, PostgreSQL, YouTube (curl) ## 👥 User Profile ### Primary User: Personal Researcher - **Role**: Individual researcher processing educational content - **Content Types**: Tech podcasts, academic lectures, audiobooks - **Workflow**: Batch URL collection → Download → Transcribe → Study - **Goals**: High accuracy transcripts, fast processing, searchable content - **Constraints**: Local storage, API costs, processing time ## 🔧 Functional Requirements ### Feature 1: YouTube URL Processing #### Purpose Extract metadata from YouTube URLs using curl to avoid API complexity #### User Stories - **As a** researcher, **I want** to provide YouTube URLs, **so that** I can get video metadata and download links - **As a** researcher, **I want** to batch process multiple URLs, **so that** I can queue up content for transcription #### Acceptance Criteria - [ ] **Given** a YouTube URL, **When** I run `trax youtube `, **Then** I get title, channel, description, and duration - [ ] **Given** a list of URLs, **When** I run `trax batch-urls `, **Then** all metadata is extracted and stored - [ ] **Given** invalid URLs, **When** I process them, **Then** clear error messages are shown #### Input Validation Rules - **URL format**: Must be valid YouTube URL - Error: "Invalid YouTube URL" - **URL accessibility**: Must be publicly accessible - Error: "Video not accessible" - **Rate limiting**: Max 10 URLs per minute - Error: "Rate limit exceeded" #### Business Logic Rules - **Rule 1**: Use curl with user-agent to avoid blocking - **Rule 2**: Extract metadata using regex patterns targeting ytInitialPlayerResponse and ytInitialData objects - **Rule 3**: Store metadata in PostgreSQL for future reference - **Rule 4**: Generate unique filenames based on video ID and title - **Rule 5**: Handle escaped characters in titles and descriptions using Perl regex patterns #### Error Handling - **Network Error**: Retry up to 3 times with exponential backoff - **Invalid URL**: Skip and continue with remaining URLs - **Rate Limited**: Wait 60 seconds before retrying ### Feature 2: Local Media Transcription (v1) #### Purpose High-accuracy transcription of downloaded media files using Whisper #### User Stories - **As a** researcher, **I want** to transcribe downloaded media, **so that** I can study the content - **As a** researcher, **I want** batch processing, **so that** I can process multiple files efficiently #### Acceptance Criteria - [ ] **Given** a downloaded media file, **When** I run `trax transcribe `, **Then** I get 95%+ accuracy transcript in <30 seconds - [ ] **Given** a folder of media files, **When** I run `trax batch `, **Then** all files are processed with progress tracking - [ ] **Given** poor audio quality, **When** I transcribe, **Then** I get a quality warning with accuracy estimate #### Input Validation Rules - **File format**: mp3, mp4, wav, m4a, webm - Error: "Unsupported format" - **File size**: ≤500MB - Error: "File too large, max 500MB" - **Audio duration**: >0.1 seconds - Error: "File too short or silent" #### Business Logic Rules - **Rule 1**: Always download media before processing (no streaming) - **Rule 2**: Convert audio to 16kHz mono WAV for Whisper - **Rule 3**: Use distil-large-v3 model with M3 optimizations - **Rule 4**: Store results in PostgreSQL with JSONB for transcripts #### Error Handling - **Whisper Memory Error**: Implement chunking for files >10 minutes - **Audio Quality**: Warn user if estimated accuracy <80% - **Processing Failure**: Save partial results, allow retry from last successful stage ### Feature 3: AI Enhancement (v2) #### Purpose Improve transcript accuracy and readability using DeepSeek #### User Stories - **As a** researcher, **I want** enhanced transcripts, **so that** technical terms and punctuation are correct - **As a** researcher, **I want** to compare original vs enhanced, **so that** I can verify improvements #### Acceptance Criteria - [ ] **Given** a v1 transcript, **When** I run enhancement, **Then** accuracy improves to ≥99% - [ ] **Given** an enhanced transcript, **When** I compare to original, **Then** no content is lost - [ ] **Given** enhancement fails, **When** I retry, **Then** original transcript is preserved #### Input Validation Rules - **Transcript format**: Must be valid JSON with segments - Error: "Invalid transcript format" - **Enhancement model**: Must be available (DeepSeek API key) - Error: "Enhancement service unavailable" #### Business Logic Rules - **Rule 1**: Preserve timestamps and speaker markers during enhancement - **Rule 2**: Use structured enhancement prompts for technical content - **Rule 3**: Cache enhancement results for 7 days to reduce costs #### Error Handling - **API Rate Limit**: Queue enhancement for later processing - **Enhancement Failure**: Return original transcript with error flag - **Content Loss**: Validate enhancement preserves original length ±5% ## 🖥️ User Interface Flows ### Flow 1: YouTube URL Processing #### Screen 1: URL Input - **Purpose**: User provides YouTube URLs for processing - **Elements**: - URL input: Text input - Single URL or file path - Batch option: Flag - --batch for multiple URLs - Output format: Flag - --json, --txt for metadata export - **Actions**: - Enter: Process URL → Metadata display - Ctrl+C: Cancel operation → Return to prompt - **Validation**: URL format, accessibility, rate limits - **Error States**: "Invalid URL", "Video not accessible", "Rate limit exceeded" #### Screen 2: Metadata Display - **Purpose**: Show extracted metadata and download options - **Elements**: - Video info: Text display - Title, channel, duration - Download option: Flag - --download to save media - Queue option: Flag - --queue for batch processing - **Actions**: - Download: Save media file → Download progress - Queue: Add to batch queue → Queue confirmation - Next: Process another URL → Return to input ### Flow 2: Batch Transcription #### Screen 1: Batch Command Input - **Purpose**: User initiates batch transcription - **Elements**: - Directory path: Text input - Folder containing media files - Pipeline version: Flag - --v1, --v2 (default: v1) - Parallel workers: Flag - --workers (default: 8 for M3 MacBook) - Quality threshold: Flag - --min-accuracy (default: 80%) - **Actions**: - Enter: Start batch processing → Progress tracking - Preview: Show file list → File list display - **Validation**: Directory exists, contains supported files #### Screen 2: Batch Progress - **Purpose**: Show real-time processing status - **Elements**: - Overall progress: Progress bar - Total batch completion - Current file: Text display - Currently processing file - Quality metrics: Text display - Accuracy estimates - Queue status: Text display - Files remaining, completed, failed - **Actions**: - Continue: Automatic progression → Results summary - Pause: Suspend processing → Resume option - **Validation**: Updates every 5 seconds, shows quality warnings #### Screen 3: Results Summary - **Purpose**: Show batch processing results - **Elements**: - Success count: Text display - Files processed successfully - Failure count: Text display - Files that failed - Quality report: Text display - Average accuracy, warnings - Export options: Buttons - JSON, TXT, SRT formats - **Actions**: - Export: Save all transcripts → Export confirmation - Retry failed: Re-process failed files → Retry progress - New batch: Start over → Return to input ## 🔄 Data Flow & State Management ### Data Models (PostgreSQL Schema) #### YouTubeVideo ```json { "id": "UUID (required, primary key)", "youtube_id": "string (required, unique)", "title": "string (required)", "channel": "string (required)", "description": "text (optional)", "duration_seconds": "integer (required)", "url": "string (required)", "metadata_extracted_at": "timestamp (auto-generated)", "created_at": "timestamp (auto-generated)" } ``` #### MediaFile ```json { "id": "UUID (required, primary key)", "youtube_video_id": "UUID (optional, foreign key)", "local_path": "string (required, file location)", "media_type": "string (required, mp3, mp4, wav, etc.)", "duration_seconds": "integer (optional)", "file_size_bytes": "bigint (required)", "download_status": "enum (pending, downloading, completed, failed)", "created_at": "timestamp (auto-generated)", "updated_at": "timestamp (auto-updated)" } ``` #### Transcript ```json { "id": "UUID (required, primary key)", "media_file_id": "UUID (required, foreign key)", "pipeline_version": "string (required, v1, v2)", "raw_content": "JSONB (required, Whisper output)", "enhanced_content": "JSONB (optional, AI enhanced)", "text_content": "text (required, plain text for search)", "model_used": "string (required, whisper model version)", "processing_time_ms": "integer (required)", "word_count": "integer (required)", "accuracy_estimate": "float (optional, 0.0-1.0)", "quality_warnings": "string array (optional)", "processing_metadata": "JSONB (optional, version-specific data)", "created_at": "timestamp (auto-generated)", "enhanced_at": "timestamp (optional)", "updated_at": "timestamp (auto-updated)" } ``` ### State Transitions #### YouTubeVideo State Machine ``` [url_provided] → [metadata_extracting] → [metadata_complete] [url_provided] → [metadata_extracting] → [metadata_failed] ``` #### MediaFile State Machine ``` [pending] → [downloading] → [completed] [pending] → [downloading] → [failed] → [retry] → [downloading] ``` #### Transcript State Machine ``` [processing] → [completed] [processing] → [failed] → [retry] → [processing] [completed] → [enhancing] → [enhanced] [enhanced] → [final] ``` ### Data Validation Rules - **Rule 1**: File size must be ≤500MB for processing - **Rule 2**: Audio duration must be >0.1 seconds (not silent) - **Rule 3**: Transcript must contain at least one segment - **Rule 4**: Processing time must be >0 and <3600 seconds (1 hour) - **Rule 5**: YouTube ID must be unique in database ## 🧪 Testing Requirements ### Unit Tests - [ ] `test_youtube_metadata_extractor`: Extract metadata using curl and regex patterns - [ ] `test_media_downloader`: Download from various sources - [ ] `test_audio_preprocessor`: Convert to 16kHz mono WAV - [ ] `test_whisper_service`: Basic transcription functionality - [ ] `test_enhancement_service`: AI enhancement with DeepSeek - [ ] `test_batch_processor`: Parallel file processing with error tracking ### Integration Tests - [ ] `test_pipeline_v1`: End-to-end v1 transcription - [ ] `test_pipeline_v2`: End-to-end v2 with enhancement - [ ] `test_batch_processing`: Process 10 files in parallel - [ ] `test_database_operations`: PostgreSQL CRUD operations - [ ] `test_export_formats`: JSON and TXT export functionality ### Edge Cases - [ ] Silent audio file: Should detect and report appropriately - [ ] Corrupted media file: Should handle gracefully with clear error - [ ] Network interruption during download: Should retry automatically - [ ] Large file (>10 minutes): Should chunk automatically - [ ] Memory pressure: Should handle gracefully with resource limits - [ ] Poor audio quality: Should warn user about accuracy expectations ## 🚀 Implementation Phases ### Phase 1: Core Foundation (Weeks 1-2) **Goal**: Basic transcription working with CLI - [ ] PostgreSQL database setup with JSONB - Schema created and tested - [ ] YouTube metadata extraction with curl - Extract title, channel, description, duration using regex patterns - [ ] Basic Whisper integration (v1) - 95% accuracy on test files - [ ] Batch processing system - Handle 10+ files in parallel with error tracking - [ ] CLI implementation with Click - All commands functional - [ ] JSON/TXT export functionality - Both formats working ### Phase 2: Enhancement (Week 3) **Goal**: AI enhancement working reliably - [ ] DeepSeek integration - API calls working with retry logic - [ ] Enhancement templates - Structured prompts for technical content - [ ] Progress tracking - Real-time updates in CLI - [ ] Quality validation - Compare before/after accuracy - [ ] Error recovery - Handle API failures gracefully ### Phase 3: Roadmap - Multi-Pass Accuracy (v3) **Goal**: Multi-pass accuracy improvements - [ ] Multi-pass implementation - 3 passes with different parameters - [ ] Confidence scoring - Per-segment confidence metrics - [ ] Segment merging - Best segment selection algorithm - [ ] Performance optimization - 3x speed improvement over v1 - [ ] Memory management - Handle large files efficiently ### Phase 4: Roadmap - Speaker Diarization (v4) **Goal**: Speaker diarization and scaling - [ ] Speaker diarization - 90% speaker identification accuracy - [ ] Voice embedding database - Speaker profile storage - [ ] Caching layer - 50% cost reduction through caching - [ ] API endpoints - REST API for integration - [ ] Production deployment - Monitoring and logging ## 🔒 Security & Constraints ### Security Requirements - **API Key Management**: Secure storage of Whisper and DeepSeek API keys - **File Access**: Local file system access only - **Data Protection**: Encrypted storage for sensitive transcripts - **Input Sanitization**: Validate all file paths and URLs ### Performance Constraints - **Response Time**: <30 seconds for 5-minute audio (v1) - **Throughput**: Process 100+ files in batch - **Memory Usage**: <8GB peak memory usage (M3 MacBook 16GB) - **Database Queries**: <1 second for transcript retrieval - **Parallel Workers**: 8 workers for optimal M3 performance ### Technical Constraints - **File Formats**: mp3, mp4, wav, m4a, webm only - **File Size**: Maximum 500MB per file - **Audio Duration**: Maximum 2 hours per file - **Network**: Download-first, no streaming processing - **Storage**: Local storage required, no cloud-only processing - **YouTube**: Curl-based metadata extraction only ## ✅ Definition of Done ### Feature Complete - [ ] All acceptance criteria met with real test files - [ ] Unit tests passing with >80% coverage - [ ] Integration tests passing with actual services - [ ] Code review completed - [ ] Documentation updated in rule files - [ ] Performance benchmarks met ### Ready for Deployment - [ ] Performance targets achieved (speed, accuracy, memory) - [ ] Security review completed - [ ] Error handling tested with edge cases - [ ] User acceptance testing with real files - [ ] Rollback plan prepared for each version - [ ] Monitoring and logging configured ### Trax-Specific Criteria - [ ] Follows protocol-based architecture - [ ] Uses download-first approach (no streaming) - [ ] Implements proper error handling with actionable messages - [ ] Maintains backward compatibility across versions - [ ] Uses real files in tests (no mocks) - [ ] Follows established rule files and patterns - [ ] Handles tech podcast and academic lecture content effectively --- *This PRD is specifically designed for a personal research tool focused on tech podcasts, academic lectures, and audiobooks, with clear v1-v2 implementation and v3-v4 roadmap.*