18 KiB
Trax v2.0 PRD: High-Performance Transcription with Speaker Diarization
🎯 Product Vision
"We're building a high-performance personal transcription tool that delivers exceptional accuracy (99.5%+) and robust speaker diarization, enabling researchers to transform complex multi-speaker content into structured, searchable text with speaker identification."
🏗️ System Architecture Overview
Core Components
- Data Layer: PostgreSQL with JSONB, SQLAlchemy registry pattern (inherited from v1)
- Business Logic: Protocol-based services, async/await throughout, enhanced with multi-pass pipeline
- Interface Layer: CLI-first with Click, batch processing focus
- Integration Layer: Download-first architecture, curl-based YouTube metadata
- AI Layer: Multi-stage refinement pipeline, Pyannote.audio diarization, LoRA domain adaptation
System Boundaries
- What's In Scope: Local media processing, high-accuracy transcription, speaker diarization, domain-specific models, CLI interface
- What's Out of Scope: Real-time streaming, cloud processing, multi-user support, distributed systems
- Integration Points: Whisper API, DeepSeek API, Pyannote.audio, FFmpeg, PostgreSQL, YouTube (curl)
👥 User Profile
Primary User: Advanced Personal Researcher
- Role: Individual researcher processing complex educational content with multiple speakers
- Content Types: Tech podcasts, academic lectures, panel discussions, interviews, audiobooks
- Workflow: Batch URL collection → Download → High-accuracy transcription with diarization → Study
- Goals: 99.5%+ accuracy, speaker identification, fast processing, searchable content
- Constraints: Local storage, API costs, processing time, single-node architecture
🔧 Functional Requirements
Feature 1: Multi-Pass Transcription Pipeline
Purpose
Achieve 99.5%+ accuracy through intelligent multi-stage processing
User Stories
- As a researcher, I want ultra-high accuracy transcripts, so that I can rely on the content for detailed analysis
- As a researcher, I want fast processing despite high accuracy, so that I can process large batches efficiently
Acceptance Criteria
- Given a media file, When I run
trax transcribe --v2 <file>, Then I get 99.5%+ accuracy transcript in <25 seconds - Given a multi-pass transcript, When I compare to v1, Then accuracy improves by ≥4.5%
- Given a transcript with confidence scores, When I review, Then I can identify low-confidence segments
Input Validation Rules
- File format: mp3, mp4, wav, m4a, webm - Error: "Unsupported format"
- File size: ≤500MB - Error: "File too large, max 500MB"
- Audio duration: >0.1 seconds - Error: "File too short or silent"
Business Logic Rules
- Rule 1: First pass uses distil-small.en for speed (10-15 seconds)
- Rule 2: Second pass uses distil-large-v3 for accuracy refinement
- Rule 3: Third pass uses DeepSeek for context-aware enhancement
- Rule 4: Confidence scoring identifies segments needing refinement
- Rule 5: Parallel processing of independent pipeline stages
Error Handling
- Memory Pressure: Automatically reduce batch size and retry
- Model Loading Failure: Fall back to v1 pipeline with warning
- Processing Failure: Save partial results, allow retry from last successful stage
Feature 2: Speaker Diarization with Pyannote.audio
Purpose
Identify and label different speakers in multi-speaker content
User Stories
- As a researcher, I want speaker identification, so that I can follow conversations and discussions
- As a researcher, I want accurate speaker labels, so that I can attribute quotes and ideas correctly
Acceptance Criteria
- Given a multi-speaker file, When I run diarization, Then I get 90%+ speaker identification accuracy
- Given a diarized transcript, When I view it, Then speaker labels are clearly marked and consistent
- Given a diarization failure, When I retry, Then the system provides clear error guidance
Input Validation Rules
- Audio quality: Must have detectable speech - Error: "No speech detected"
- Speaker count: Must have ≥2 speakers for diarization - Error: "Single speaker detected"
- Audio duration: ≥30 seconds for reliable diarization - Error: "Audio too short for diarization"
Business Logic Rules
- Rule 1: Run diarization in parallel with transcription
- Rule 2: Use Pyannote.audio with optimized parameters for speed
- Rule 3: Cache speaker embedding model to avoid reloading
- Rule 4: Merge diarization results with transcript timestamps
- Rule 5: Provide speaker count estimation before processing
Error Handling
- Diarization Failure: Continue with transcription only, mark as single speaker
- Memory Issues: Reduce audio chunk size and retry
- Model Loading: Provide clear instructions for HuggingFace token setup
Feature 3: Domain-Specific Model Adaptation (LoRA)
Purpose
Improve accuracy for specific content domains using lightweight model adaptation
User Stories
- As a researcher, I want domain-specific accuracy, so that technical terms and jargon are correctly transcribed
- As a researcher, I want flexible domain selection, so that I can optimize for different content types
Acceptance Criteria
- Given a technical podcast, When I use technical domain, Then technical terms are more accurately transcribed
- Given a medical lecture, When I use medical domain, Then medical terminology is correctly captured
- Given a domain model, When I switch domains, Then the system loads the appropriate LoRA adapter
Input Validation Rules
- Domain selection: Must be valid domain (technical, medical, academic, general) - Error: "Invalid domain"
- LoRA availability: Domain model must be available - Error: "Domain model not available"
Business Logic Rules
- Rule 1: Load base Whisper model once, swap LoRA adapters as needed
- Rule 2: Cache LoRA adapters in memory for fast switching
- Rule 3: Provide domain auto-detection based on content analysis
- Rule 4: Allow custom domain training with user-provided data
Error Handling
- LoRA Loading Failure: Fall back to base model with warning
- Domain Detection Failure: Use general domain as default
- Memory Issues: Unload unused adapters automatically
Feature 4: Enhanced CLI Interface
Purpose
Provide an enhanced command-line interface with improved batch processing and progress reporting
User Stories
- As a researcher, I want enhanced CLI progress reporting, so that I can monitor long-running jobs effectively
- As a researcher, I want improved batch processing, so that I can efficiently process multiple files
Acceptance Criteria
- Given a batch of files, When I run batch processing, Then I see real-time progress for each file
- Given a processing job, When I monitor progress, Then I see detailed stage information and performance metrics
- Given a completed transcript, When I view it, Then I can see speaker labels and confidence scores in the output
Input Validation Rules
- File processing: Max 500MB per file - Error: "File too large"
- File types: mp3, mp4, wav, m4a, webm - Error: "Unsupported format"
- Batch size: Max 50 files per batch - Error: "Batch too large"
Business Logic Rules
- Rule 1: Real-time progress updates via CLI output
- Rule 2: Batch processing with configurable concurrency
- Rule 3: Detailed logging with configurable verbosity
- Rule 4: Processing jobs use same pipeline as single files
- Rule 5: Transcript output includes speaker diarization information
Error Handling
- Processing Failure: Clear error message with retry guidance
- Batch Failure: Continue with remaining files, report failures
- Memory Issues: Automatic batch size reduction with warning
💻 CLI Interface Flows
Flow 1: High-Performance Transcription
Command: Single File Processing
# Basic v2 transcription
trax transcribe --v2 audio.mp3
# With diarization
trax transcribe --v2 --diarize audio.mp3
# With domain-specific model
trax transcribe --v2 --domain technical audio.mp3
# With custom quality threshold
trax transcribe --v2 --accuracy 0.995 audio.mp3
Progress Reporting
- Real-time progress: Stage-by-stage progress with time estimates
- Performance metrics: CPU usage, memory usage, processing speed
- Quality indicators: Confidence scores, accuracy estimates
- Error reporting: Clear error messages with retry guidance
Flow 2: Batch Processing with Diarization
Command: Batch Processing
# Process directory of files
trax batch --v2 --diarize /path/to/media/files/
# With parallel processing
trax batch --v2 --workers 4 --diarize /path/to/media/files/
# With domain detection
trax batch --v2 --auto-domain --diarize /path/to/media/files/
Batch Progress Reporting
- Overall progress: Total batch completion percentage
- Current file: Currently processing file with stage
- Diarization status: Speaker count, processing stage
- Queue status: Files remaining, completed, failed
- Performance metrics: Average processing time, accuracy
🔄 Data Flow & State Management
Enhanced Data Models (PostgreSQL Schema)
Transcript (Enhanced for v2)
{
"id": "UUID (required, primary key)",
"media_file_id": "UUID (required, foreign key)",
"pipeline_version": "string (required, v1, v2, v2+)",
"raw_content": "JSONB (required, Whisper output)",
"enhanced_content": "JSONB (optional, AI enhanced)",
"diarization_content": "JSONB (optional, Pyannote output)",
"merged_content": "JSONB (required, final transcript with speakers)",
"text_content": "text (required, plain text for search)",
"model_used": "string (required, whisper model version)",
"domain_used": "string (optional, technical, medical, etc.)",
"processing_time_ms": "integer (required)",
"word_count": "integer (required)",
"accuracy_estimate": "float (optional, 0.0-1.0)",
"confidence_scores": "JSONB (optional, per-segment confidence)",
"speaker_count": "integer (optional, number of speakers detected)",
"quality_warnings": "string array (optional)",
"processing_metadata": "JSONB (optional, version-specific data)",
"created_at": "timestamp (auto-generated)",
"enhanced_at": "timestamp (optional)",
"diarized_at": "timestamp (optional)",
"updated_at": "timestamp (auto-updated)"
}
SpeakerProfile (New for v2)
{
"id": "UUID (required, primary key)",
"transcript_id": "UUID (required, foreign key)",
"speaker_id": "string (required, speaker label)",
"embedding_vector": "JSONB (required, speaker embedding)",
"speech_segments": "JSONB (required, time segments)",
"total_duration": "float (required, seconds)",
"word_count": "integer (required)",
"confidence_score": "float (optional, 0.0-1.0)",
"created_at": "timestamp (auto-generated)"
}
ProcessingJob (New for v2)
{
"id": "UUID (required, primary key)",
"media_file_id": "UUID (required, foreign key)",
"pipeline_config": "JSONB (required, processing parameters)",
"status": "enum (queued, processing, completed, failed)",
"current_stage": "string (optional, current pipeline stage)",
"progress_percentage": "float (optional, 0.0-100.0)",
"error_message": "text (optional)",
"started_at": "timestamp (optional)",
"completed_at": "timestamp (optional)",
"created_at": "timestamp (auto-generated)",
"updated_at": "timestamp (auto-updated)"
}
Enhanced State Transitions
ProcessingJob State Machine
[queued] → [processing] → [transcribing] → [enhancing] → [diarizing] → [merging] → [completed]
[queued] → [processing] → [failed] → [retry] → [processing]
Transcript State Machine (Enhanced)
[processing] → [transcribed] → [enhanced] → [diarized] → [merged] → [completed]
[processing] → [transcribed] → [enhanced] → [completed] (no diarization)
[processing] → [failed] → [retry] → [processing]
Data Validation Rules
- Rule 1: Processing time must be >0 and <1800 seconds (30 minutes)
- Rule 2: Accuracy estimate must be between 0.0 and 1.0
- Rule 3: Speaker count must be ≥1 if diarization is enabled
- Rule 4: Confidence scores must be between 0.0 and 1.0
- Rule 5: Domain must be valid if specified
🧪 Testing Requirements
Unit Tests
test_multi_pass_pipeline: Test all pipeline stages and transitionstest_diarization_service: Test Pyannote.audio integrationtest_lora_adapter_manager: Test domain-specific model loadingtest_confidence_scoring: Test confidence calculation and thresholdingtest_web_interface: Test Flask/FastAPI endpointstest_parallel_processing: Test concurrent pipeline execution
Integration Tests
test_pipeline_v2_complete: End-to-end v2 transcription with diarizationtest_domain_adaptation: Test LoRA adapter switching and accuracytest_batch_processing_v2: Process 10 files with v2 pipelinetest_cli_batch_processing: Test CLI batch processing with multiple filestest_performance_targets: Verify <25 second processing time
Edge Cases
- Single speaker in multi-speaker file: Should handle gracefully
- Poor audio quality with diarization: Should provide clear warnings
- Memory pressure during processing: Should handle gracefully
- LoRA adapter loading failure: Should fall back to base model
- CLI progress reporting: Should show real-time updates
- Large files with diarization: Should chunk appropriately
🚀 Implementation Phases
Phase 1: Multi-Pass Pipeline Foundation (Weeks 1-2)
Goal: Implement core multi-pass transcription pipeline
- Enhanced task system with pipeline stages - Support complex multi-stage workflows
- ModelManager singleton for model caching - Prevent memory duplication
- Multi-pass implementation (fast + refinement + enhancement) - Achieve 99.5%+ accuracy
- Confidence scoring system - Identify low-confidence segments
- Performance optimization (8-bit quantization) - Reduce memory usage by 50%
Phase 2: Speaker Diarization Integration (Weeks 3-4)
Goal: Integrate Pyannote.audio for speaker identification
- Pyannote.audio integration - 90%+ speaker identification accuracy
- Parallel diarization and transcription - Minimize total processing time
- Speaker embedding caching - Avoid model reloading
- Diarization-transcript merging - Combine timestamps and speaker labels
- Speaker profile storage - Track speakers across multiple files
Phase 3: Domain Adaptation and LoRA (Weeks 5-6)
Goal: Implement domain-specific model adaptation
- LoRA adapter system - Lightweight domain-specific models
- Domain auto-detection - Automatic content analysis
- Pre-trained domain models - Technical, medical, academic domains
- Custom domain training - User-provided data support
- Domain switching optimization - Fast adapter loading
Phase 4: Enhanced CLI Interface (Weeks 7-8)
Goal: Develop enhanced CLI interface with improved batch processing
- Enhanced progress reporting - Real-time stage updates
- Batch processing improvements - Configurable concurrency
- Detailed logging system - Configurable verbosity levels
- Performance monitoring - CPU/memory usage display
- Error handling improvements - Clear retry guidance
Phase 5: Performance Optimization and Polish (Weeks 9-10)
Goal: Achieve performance targets and final polish
- Performance benchmarking - Verify <25 second processing time
- Memory optimization - Stay under 8GB peak usage
- Error handling refinement - Comprehensive error recovery
- Documentation and user guides - Complete documentation
- Final testing and validation - End-to-end testing
🔒 Security & Constraints
Security Requirements
- API Key Management: Secure storage of all API keys (Whisper, DeepSeek, HuggingFace)
- Local Access Only: CLI interface only, no network exposure
- File Access: Local file system access only
- Data Protection: Encrypted storage for sensitive transcripts
- Input Sanitization: Validate all file paths, URLs, and user inputs
Performance Constraints
- Response Time: <25 seconds for 5-minute audio (v2)
- Accuracy Target: 99.5%+ transcription accuracy
- Diarization Accuracy: 90%+ speaker identification accuracy
- Memory Usage: <8GB peak memory usage (M3 MacBook 16GB)
- Parallel Workers: 8 workers for optimal M3 performance
- Model Loading: <5 seconds for model switching
Technical Constraints
- File Formats: mp3, mp4, wav, m4a, webm only
- File Size: Maximum 500MB per file
- Audio Duration: Maximum 2 hours per file
- Network: Download-first, no streaming processing
- Storage: Local storage required, no cloud-only processing
- Single Node: No distributed processing, single-machine architecture
- YouTube: Curl-based metadata extraction only
✅ Definition of Done
Feature Complete
- All acceptance criteria met with real test files
- 99.5%+ accuracy achieved on test dataset
- 90%+ speaker identification accuracy achieved
- <25 second processing time for 5-minute files
- Unit tests passing with >80% coverage
- Integration tests passing with actual services
- Code review completed
- Documentation updated in rule files
- Performance benchmarks met
Ready for Deployment
- Performance targets achieved (speed, accuracy, memory)
- Security review completed
- Error handling tested with edge cases
- User acceptance testing with real files
- CLI interface tested and functional
- Rollback plan prepared for v2 deployment
- Monitoring and logging configured
Trax v2-Specific Criteria
- Multi-pass pipeline delivers 99.5%+ accuracy
- Speaker diarization works reliably across content types
- Domain adaptation improves accuracy for specialized content
- CLI interface provides superior user experience
- Performance targets met without distributed architecture
- Memory usage optimized for single-node deployment
- Backward compatibility maintained with v1 features
This PRD is specifically designed for Trax v2, focusing on high performance and speaker diarization as the core differentiators while maintaining the simplicity and determinism of the single-node architecture.