# Trax v2.0 PRD: High-Performance Transcription with Speaker Diarization ## 🎯 Product Vision *"We're building a high-performance personal transcription tool that delivers exceptional accuracy (99.5%+) and robust speaker diarization, enabling researchers to transform complex multi-speaker content into structured, searchable text with speaker identification."* ## 🏗️ System Architecture Overview ### Core Components - **Data Layer**: PostgreSQL with JSONB, SQLAlchemy registry pattern (inherited from v1) - **Business Logic**: Protocol-based services, async/await throughout, enhanced with multi-pass pipeline - **Interface Layer**: CLI-first with Click, batch processing focus - **Integration Layer**: Download-first architecture, curl-based YouTube metadata - **AI Layer**: Multi-stage refinement pipeline, Pyannote.audio diarization, LoRA domain adaptation ### System Boundaries - **What's In Scope**: Local media processing, high-accuracy transcription, speaker diarization, domain-specific models, CLI interface - **What's Out of Scope**: Real-time streaming, cloud processing, multi-user support, distributed systems - **Integration Points**: Whisper API, DeepSeek API, Pyannote.audio, FFmpeg, PostgreSQL, YouTube (curl) ## 👥 User Profile ### Primary User: Advanced Personal Researcher - **Role**: Individual researcher processing complex educational content with multiple speakers - **Content Types**: Tech podcasts, academic lectures, panel discussions, interviews, audiobooks - **Workflow**: Batch URL collection → Download → High-accuracy transcription with diarization → Study - **Goals**: 99.5%+ accuracy, speaker identification, fast processing, searchable content - **Constraints**: Local storage, API costs, processing time, single-node architecture ## 🔧 Functional Requirements ### Feature 1: Multi-Pass Transcription Pipeline #### Purpose Achieve 99.5%+ accuracy through intelligent multi-stage processing #### User Stories - **As a** researcher, **I want** ultra-high accuracy transcripts, **so that** I can rely on the content for detailed analysis - **As a** researcher, **I want** fast processing despite high accuracy, **so that** I can process large batches efficiently #### Acceptance Criteria - [ ] **Given** a media file, **When** I run `trax transcribe --v2 `, **Then** I get 99.5%+ accuracy transcript in <25 seconds - [ ] **Given** a multi-pass transcript, **When** I compare to v1, **Then** accuracy improves by ≥4.5% - [ ] **Given** a transcript with confidence scores, **When** I review, **Then** I can identify low-confidence segments #### Input Validation Rules - **File format**: mp3, mp4, wav, m4a, webm - Error: "Unsupported format" - **File size**: ≤500MB - Error: "File too large, max 500MB" - **Audio duration**: >0.1 seconds - Error: "File too short or silent" #### Business Logic Rules - **Rule 1**: First pass uses distil-small.en for speed (10-15 seconds) - **Rule 2**: Second pass uses distil-large-v3 for accuracy refinement - **Rule 3**: Third pass uses DeepSeek for context-aware enhancement - **Rule 4**: Confidence scoring identifies segments needing refinement - **Rule 5**: Parallel processing of independent pipeline stages #### Error Handling - **Memory Pressure**: Automatically reduce batch size and retry - **Model Loading Failure**: Fall back to v1 pipeline with warning - **Processing Failure**: Save partial results, allow retry from last successful stage ### Feature 2: Speaker Diarization with Pyannote.audio #### Purpose Identify and label different speakers in multi-speaker content #### User Stories - **As a** researcher, **I want** speaker identification, **so that** I can follow conversations and discussions - **As a** researcher, **I want** accurate speaker labels, **so that** I can attribute quotes and ideas correctly #### Acceptance Criteria - [ ] **Given** a multi-speaker file, **When** I run diarization, **Then** I get 90%+ speaker identification accuracy - [ ] **Given** a diarized transcript, **When** I view it, **Then** speaker labels are clearly marked and consistent - [ ] **Given** a diarization failure, **When** I retry, **Then** the system provides clear error guidance #### Input Validation Rules - **Audio quality**: Must have detectable speech - Error: "No speech detected" - **Speaker count**: Must have ≥2 speakers for diarization - Error: "Single speaker detected" - **Audio duration**: ≥30 seconds for reliable diarization - Error: "Audio too short for diarization" #### Business Logic Rules - **Rule 1**: Run diarization in parallel with transcription - **Rule 2**: Use Pyannote.audio with optimized parameters for speed - **Rule 3**: Cache speaker embedding model to avoid reloading - **Rule 4**: Merge diarization results with transcript timestamps - **Rule 5**: Provide speaker count estimation before processing #### Error Handling - **Diarization Failure**: Continue with transcription only, mark as single speaker - **Memory Issues**: Reduce audio chunk size and retry - **Model Loading**: Provide clear instructions for HuggingFace token setup ### Feature 3: Domain-Specific Model Adaptation (LoRA) #### Purpose Improve accuracy for specific content domains using lightweight model adaptation #### User Stories - **As a** researcher, **I want** domain-specific accuracy, **so that** technical terms and jargon are correctly transcribed - **As a** researcher, **I want** flexible domain selection, **so that** I can optimize for different content types #### Acceptance Criteria - [ ] **Given** a technical podcast, **When** I use technical domain, **Then** technical terms are more accurately transcribed - [ ] **Given** a medical lecture, **When** I use medical domain, **Then** medical terminology is correctly captured - [ ] **Given** a domain model, **When** I switch domains, **Then** the system loads the appropriate LoRA adapter #### Input Validation Rules - **Domain selection**: Must be valid domain (technical, medical, academic, general) - Error: "Invalid domain" - **LoRA availability**: Domain model must be available - Error: "Domain model not available" #### Business Logic Rules - **Rule 1**: Load base Whisper model once, swap LoRA adapters as needed - **Rule 2**: Cache LoRA adapters in memory for fast switching - **Rule 3**: Provide domain auto-detection based on content analysis - **Rule 4**: Allow custom domain training with user-provided data #### Error Handling - **LoRA Loading Failure**: Fall back to base model with warning - **Domain Detection Failure**: Use general domain as default - **Memory Issues**: Unload unused adapters automatically ### Feature 4: Enhanced CLI Interface #### Purpose Provide an enhanced command-line interface with improved batch processing and progress reporting #### User Stories - **As a** researcher, **I want** enhanced CLI progress reporting, **so that** I can monitor long-running jobs effectively - **As a** researcher, **I want** improved batch processing, **so that** I can efficiently process multiple files #### Acceptance Criteria - [ ] **Given** a batch of files, **When** I run batch processing, **Then** I see real-time progress for each file - [ ] **Given** a processing job, **When** I monitor progress, **Then** I see detailed stage information and performance metrics - [ ] **Given** a completed transcript, **When** I view it, **Then** I can see speaker labels and confidence scores in the output #### Input Validation Rules - **File processing**: Max 500MB per file - Error: "File too large" - **File types**: mp3, mp4, wav, m4a, webm - Error: "Unsupported format" - **Batch size**: Max 50 files per batch - Error: "Batch too large" #### Business Logic Rules - **Rule 1**: Real-time progress updates via CLI output - **Rule 2**: Batch processing with configurable concurrency - **Rule 3**: Detailed logging with configurable verbosity - **Rule 4**: Processing jobs use same pipeline as single files - **Rule 5**: Transcript output includes speaker diarization information #### Error Handling - **Processing Failure**: Clear error message with retry guidance - **Batch Failure**: Continue with remaining files, report failures - **Memory Issues**: Automatic batch size reduction with warning ## 💻 CLI Interface Flows ### Flow 1: High-Performance Transcription #### Command: Single File Processing ```bash # Basic v2 transcription trax transcribe --v2 audio.mp3 # With diarization trax transcribe --v2 --diarize audio.mp3 # With domain-specific model trax transcribe --v2 --domain technical audio.mp3 # With custom quality threshold trax transcribe --v2 --accuracy 0.995 audio.mp3 ``` #### Progress Reporting - **Real-time progress**: Stage-by-stage progress with time estimates - **Performance metrics**: CPU usage, memory usage, processing speed - **Quality indicators**: Confidence scores, accuracy estimates - **Error reporting**: Clear error messages with retry guidance ### Flow 2: Batch Processing with Diarization #### Command: Batch Processing ```bash # Process directory of files trax batch --v2 --diarize /path/to/media/files/ # With parallel processing trax batch --v2 --workers 4 --diarize /path/to/media/files/ # With domain detection trax batch --v2 --auto-domain --diarize /path/to/media/files/ ``` #### Batch Progress Reporting - **Overall progress**: Total batch completion percentage - **Current file**: Currently processing file with stage - **Diarization status**: Speaker count, processing stage - **Queue status**: Files remaining, completed, failed - **Performance metrics**: Average processing time, accuracy ## 🔄 Data Flow & State Management ### Enhanced Data Models (PostgreSQL Schema) #### Transcript (Enhanced for v2) ```json { "id": "UUID (required, primary key)", "media_file_id": "UUID (required, foreign key)", "pipeline_version": "string (required, v1, v2, v2+)", "raw_content": "JSONB (required, Whisper output)", "enhanced_content": "JSONB (optional, AI enhanced)", "diarization_content": "JSONB (optional, Pyannote output)", "merged_content": "JSONB (required, final transcript with speakers)", "text_content": "text (required, plain text for search)", "model_used": "string (required, whisper model version)", "domain_used": "string (optional, technical, medical, etc.)", "processing_time_ms": "integer (required)", "word_count": "integer (required)", "accuracy_estimate": "float (optional, 0.0-1.0)", "confidence_scores": "JSONB (optional, per-segment confidence)", "speaker_count": "integer (optional, number of speakers detected)", "quality_warnings": "string array (optional)", "processing_metadata": "JSONB (optional, version-specific data)", "created_at": "timestamp (auto-generated)", "enhanced_at": "timestamp (optional)", "diarized_at": "timestamp (optional)", "updated_at": "timestamp (auto-updated)" } ``` #### SpeakerProfile (New for v2) ```json { "id": "UUID (required, primary key)", "transcript_id": "UUID (required, foreign key)", "speaker_id": "string (required, speaker label)", "embedding_vector": "JSONB (required, speaker embedding)", "speech_segments": "JSONB (required, time segments)", "total_duration": "float (required, seconds)", "word_count": "integer (required)", "confidence_score": "float (optional, 0.0-1.0)", "created_at": "timestamp (auto-generated)" } ``` #### ProcessingJob (New for v2) ```json { "id": "UUID (required, primary key)", "media_file_id": "UUID (required, foreign key)", "pipeline_config": "JSONB (required, processing parameters)", "status": "enum (queued, processing, completed, failed)", "current_stage": "string (optional, current pipeline stage)", "progress_percentage": "float (optional, 0.0-100.0)", "error_message": "text (optional)", "started_at": "timestamp (optional)", "completed_at": "timestamp (optional)", "created_at": "timestamp (auto-generated)", "updated_at": "timestamp (auto-updated)" } ``` ### Enhanced State Transitions #### ProcessingJob State Machine ``` [queued] → [processing] → [transcribing] → [enhancing] → [diarizing] → [merging] → [completed] [queued] → [processing] → [failed] → [retry] → [processing] ``` #### Transcript State Machine (Enhanced) ``` [processing] → [transcribed] → [enhanced] → [diarized] → [merged] → [completed] [processing] → [transcribed] → [enhanced] → [completed] (no diarization) [processing] → [failed] → [retry] → [processing] ``` ### Data Validation Rules - **Rule 1**: Processing time must be >0 and <1800 seconds (30 minutes) - **Rule 2**: Accuracy estimate must be between 0.0 and 1.0 - **Rule 3**: Speaker count must be ≥1 if diarization is enabled - **Rule 4**: Confidence scores must be between 0.0 and 1.0 - **Rule 5**: Domain must be valid if specified ## 🧪 Testing Requirements ### Unit Tests - [ ] `test_multi_pass_pipeline`: Test all pipeline stages and transitions - [ ] `test_diarization_service`: Test Pyannote.audio integration - [ ] `test_lora_adapter_manager`: Test domain-specific model loading - [ ] `test_confidence_scoring`: Test confidence calculation and thresholding - [ ] `test_web_interface`: Test Flask/FastAPI endpoints - [ ] `test_parallel_processing`: Test concurrent pipeline execution ### Integration Tests - [ ] `test_pipeline_v2_complete`: End-to-end v2 transcription with diarization - [ ] `test_domain_adaptation`: Test LoRA adapter switching and accuracy - [ ] `test_batch_processing_v2`: Process 10 files with v2 pipeline - [ ] `test_cli_batch_processing`: Test CLI batch processing with multiple files - [ ] `test_performance_targets`: Verify <25 second processing time ### Edge Cases - [ ] Single speaker in multi-speaker file: Should handle gracefully - [ ] Poor audio quality with diarization: Should provide clear warnings - [ ] Memory pressure during processing: Should handle gracefully - [ ] LoRA adapter loading failure: Should fall back to base model - [ ] CLI progress reporting: Should show real-time updates - [ ] Large files with diarization: Should chunk appropriately ## 🚀 Implementation Phases ### Phase 1: Multi-Pass Pipeline Foundation (Weeks 1-2) **Goal**: Implement core multi-pass transcription pipeline - [ ] Enhanced task system with pipeline stages - Support complex multi-stage workflows - [ ] ModelManager singleton for model caching - Prevent memory duplication - [ ] Multi-pass implementation (fast + refinement + enhancement) - Achieve 99.5%+ accuracy - [ ] Confidence scoring system - Identify low-confidence segments - [ ] Performance optimization (8-bit quantization) - Reduce memory usage by 50% ### Phase 2: Speaker Diarization Integration (Weeks 3-4) **Goal**: Integrate Pyannote.audio for speaker identification - [ ] Pyannote.audio integration - 90%+ speaker identification accuracy - [ ] Parallel diarization and transcription - Minimize total processing time - [ ] Speaker embedding caching - Avoid model reloading - [ ] Diarization-transcript merging - Combine timestamps and speaker labels - [ ] Speaker profile storage - Track speakers across multiple files ### Phase 3: Domain Adaptation and LoRA (Weeks 5-6) **Goal**: Implement domain-specific model adaptation - [ ] LoRA adapter system - Lightweight domain-specific models - [ ] Domain auto-detection - Automatic content analysis - [ ] Pre-trained domain models - Technical, medical, academic domains - [ ] Custom domain training - User-provided data support - [ ] Domain switching optimization - Fast adapter loading ### Phase 4: Enhanced CLI Interface (Weeks 7-8) **Goal**: Develop enhanced CLI interface with improved batch processing - [ ] Enhanced progress reporting - Real-time stage updates - [ ] Batch processing improvements - Configurable concurrency - [ ] Detailed logging system - Configurable verbosity levels - [ ] Performance monitoring - CPU/memory usage display - [ ] Error handling improvements - Clear retry guidance ### Phase 5: Performance Optimization and Polish (Weeks 9-10) **Goal**: Achieve performance targets and final polish - [ ] Performance benchmarking - Verify <25 second processing time - [ ] Memory optimization - Stay under 8GB peak usage - [ ] Error handling refinement - Comprehensive error recovery - [ ] Documentation and user guides - Complete documentation - [ ] Final testing and validation - End-to-end testing ## 🔒 Security & Constraints ### Security Requirements - **API Key Management**: Secure storage of all API keys (Whisper, DeepSeek, HuggingFace) - **Local Access Only**: CLI interface only, no network exposure - **File Access**: Local file system access only - **Data Protection**: Encrypted storage for sensitive transcripts - **Input Sanitization**: Validate all file paths, URLs, and user inputs ### Performance Constraints - **Response Time**: <25 seconds for 5-minute audio (v2) - **Accuracy Target**: 99.5%+ transcription accuracy - **Diarization Accuracy**: 90%+ speaker identification accuracy - **Memory Usage**: <8GB peak memory usage (M3 MacBook 16GB) - **Parallel Workers**: 8 workers for optimal M3 performance - **Model Loading**: <5 seconds for model switching ### Technical Constraints - **File Formats**: mp3, mp4, wav, m4a, webm only - **File Size**: Maximum 500MB per file - **Audio Duration**: Maximum 2 hours per file - **Network**: Download-first, no streaming processing - **Storage**: Local storage required, no cloud-only processing - **Single Node**: No distributed processing, single-machine architecture - **YouTube**: Curl-based metadata extraction only ## ✅ Definition of Done ### Feature Complete - [ ] All acceptance criteria met with real test files - [ ] 99.5%+ accuracy achieved on test dataset - [ ] 90%+ speaker identification accuracy achieved - [ ] <25 second processing time for 5-minute files - [ ] Unit tests passing with >80% coverage - [ ] Integration tests passing with actual services - [ ] Code review completed - [ ] Documentation updated in rule files - [ ] Performance benchmarks met ### Ready for Deployment - [ ] Performance targets achieved (speed, accuracy, memory) - [ ] Security review completed - [ ] Error handling tested with edge cases - [ ] User acceptance testing with real files - [ ] CLI interface tested and functional - [ ] Rollback plan prepared for v2 deployment - [ ] Monitoring and logging configured ### Trax v2-Specific Criteria - [ ] Multi-pass pipeline delivers 99.5%+ accuracy - [ ] Speaker diarization works reliably across content types - [ ] Domain adaptation improves accuracy for specialized content - [ ] CLI interface provides superior user experience - [ ] Performance targets met without distributed architecture - [ ] Memory usage optimized for single-node deployment - [ ] Backward compatibility maintained with v1 features --- *This PRD is specifically designed for Trax v2, focusing on high performance and speaker diarization as the core differentiators while maintaining the simplicity and determinism of the single-node architecture.*