trax/.taskmaster/docs/prd-v2.0.md

401 lines
18 KiB
Markdown

# Trax v2.0 PRD: High-Performance Transcription with Speaker Diarization
## 🎯 Product Vision
*"We're building a high-performance personal transcription tool that delivers exceptional accuracy (99.5%+) and robust speaker diarization, enabling researchers to transform complex multi-speaker content into structured, searchable text with speaker identification."*
## 🏗️ System Architecture Overview
### Core Components
- **Data Layer**: PostgreSQL with JSONB, SQLAlchemy registry pattern (inherited from v1)
- **Business Logic**: Protocol-based services, async/await throughout, enhanced with multi-pass pipeline
- **Interface Layer**: CLI-first with Click, batch processing focus
- **Integration Layer**: Download-first architecture, curl-based YouTube metadata
- **AI Layer**: Multi-stage refinement pipeline, Pyannote.audio diarization, LoRA domain adaptation
### System Boundaries
- **What's In Scope**: Local media processing, high-accuracy transcription, speaker diarization, domain-specific models, CLI interface
- **What's Out of Scope**: Real-time streaming, cloud processing, multi-user support, distributed systems
- **Integration Points**: Whisper API, DeepSeek API, Pyannote.audio, FFmpeg, PostgreSQL, YouTube (curl)
## 👥 User Profile
### Primary User: Advanced Personal Researcher
- **Role**: Individual researcher processing complex educational content with multiple speakers
- **Content Types**: Tech podcasts, academic lectures, panel discussions, interviews, audiobooks
- **Workflow**: Batch URL collection → Download → High-accuracy transcription with diarization → Study
- **Goals**: 99.5%+ accuracy, speaker identification, fast processing, searchable content
- **Constraints**: Local storage, API costs, processing time, single-node architecture
## 🔧 Functional Requirements
### Feature 1: Multi-Pass Transcription Pipeline
#### Purpose
Achieve 99.5%+ accuracy through intelligent multi-stage processing
#### User Stories
- **As a** researcher, **I want** ultra-high accuracy transcripts, **so that** I can rely on the content for detailed analysis
- **As a** researcher, **I want** fast processing despite high accuracy, **so that** I can process large batches efficiently
#### Acceptance Criteria
- [ ] **Given** a media file, **When** I run `trax transcribe --v2 <file>`, **Then** I get 99.5%+ accuracy transcript in <25 seconds
- [ ] **Given** a multi-pass transcript, **When** I compare to v1, **Then** accuracy improves by 4.5%
- [ ] **Given** a transcript with confidence scores, **When** I review, **Then** I can identify low-confidence segments
#### Input Validation Rules
- **File format**: mp3, mp4, wav, m4a, webm - Error: "Unsupported format"
- **File size**: 500MB - Error: "File too large, max 500MB"
- **Audio duration**: >0.1 seconds - Error: "File too short or silent"
#### Business Logic Rules
- **Rule 1**: First pass uses distil-small.en for speed (10-15 seconds)
- **Rule 2**: Second pass uses distil-large-v3 for accuracy refinement
- **Rule 3**: Third pass uses DeepSeek for context-aware enhancement
- **Rule 4**: Confidence scoring identifies segments needing refinement
- **Rule 5**: Parallel processing of independent pipeline stages
#### Error Handling
- **Memory Pressure**: Automatically reduce batch size and retry
- **Model Loading Failure**: Fall back to v1 pipeline with warning
- **Processing Failure**: Save partial results, allow retry from last successful stage
### Feature 2: Speaker Diarization with Pyannote.audio
#### Purpose
Identify and label different speakers in multi-speaker content
#### User Stories
- **As a** researcher, **I want** speaker identification, **so that** I can follow conversations and discussions
- **As a** researcher, **I want** accurate speaker labels, **so that** I can attribute quotes and ideas correctly
#### Acceptance Criteria
- [ ] **Given** a multi-speaker file, **When** I run diarization, **Then** I get 90%+ speaker identification accuracy
- [ ] **Given** a diarized transcript, **When** I view it, **Then** speaker labels are clearly marked and consistent
- [ ] **Given** a diarization failure, **When** I retry, **Then** the system provides clear error guidance
#### Input Validation Rules
- **Audio quality**: Must have detectable speech - Error: "No speech detected"
- **Speaker count**: Must have ≥2 speakers for diarization - Error: "Single speaker detected"
- **Audio duration**: ≥30 seconds for reliable diarization - Error: "Audio too short for diarization"
#### Business Logic Rules
- **Rule 1**: Run diarization in parallel with transcription
- **Rule 2**: Use Pyannote.audio with optimized parameters for speed
- **Rule 3**: Cache speaker embedding model to avoid reloading
- **Rule 4**: Merge diarization results with transcript timestamps
- **Rule 5**: Provide speaker count estimation before processing
#### Error Handling
- **Diarization Failure**: Continue with transcription only, mark as single speaker
- **Memory Issues**: Reduce audio chunk size and retry
- **Model Loading**: Provide clear instructions for HuggingFace token setup
### Feature 3: Domain-Specific Model Adaptation (LoRA)
#### Purpose
Improve accuracy for specific content domains using lightweight model adaptation
#### User Stories
- **As a** researcher, **I want** domain-specific accuracy, **so that** technical terms and jargon are correctly transcribed
- **As a** researcher, **I want** flexible domain selection, **so that** I can optimize for different content types
#### Acceptance Criteria
- [ ] **Given** a technical podcast, **When** I use technical domain, **Then** technical terms are more accurately transcribed
- [ ] **Given** a medical lecture, **When** I use medical domain, **Then** medical terminology is correctly captured
- [ ] **Given** a domain model, **When** I switch domains, **Then** the system loads the appropriate LoRA adapter
#### Input Validation Rules
- **Domain selection**: Must be valid domain (technical, medical, academic, general) - Error: "Invalid domain"
- **LoRA availability**: Domain model must be available - Error: "Domain model not available"
#### Business Logic Rules
- **Rule 1**: Load base Whisper model once, swap LoRA adapters as needed
- **Rule 2**: Cache LoRA adapters in memory for fast switching
- **Rule 3**: Provide domain auto-detection based on content analysis
- **Rule 4**: Allow custom domain training with user-provided data
#### Error Handling
- **LoRA Loading Failure**: Fall back to base model with warning
- **Domain Detection Failure**: Use general domain as default
- **Memory Issues**: Unload unused adapters automatically
### Feature 4: Enhanced CLI Interface
#### Purpose
Provide an enhanced command-line interface with improved batch processing and progress reporting
#### User Stories
- **As a** researcher, **I want** enhanced CLI progress reporting, **so that** I can monitor long-running jobs effectively
- **As a** researcher, **I want** improved batch processing, **so that** I can efficiently process multiple files
#### Acceptance Criteria
- [ ] **Given** a batch of files, **When** I run batch processing, **Then** I see real-time progress for each file
- [ ] **Given** a processing job, **When** I monitor progress, **Then** I see detailed stage information and performance metrics
- [ ] **Given** a completed transcript, **When** I view it, **Then** I can see speaker labels and confidence scores in the output
#### Input Validation Rules
- **File processing**: Max 500MB per file - Error: "File too large"
- **File types**: mp3, mp4, wav, m4a, webm - Error: "Unsupported format"
- **Batch size**: Max 50 files per batch - Error: "Batch too large"
#### Business Logic Rules
- **Rule 1**: Real-time progress updates via CLI output
- **Rule 2**: Batch processing with configurable concurrency
- **Rule 3**: Detailed logging with configurable verbosity
- **Rule 4**: Processing jobs use same pipeline as single files
- **Rule 5**: Transcript output includes speaker diarization information
#### Error Handling
- **Processing Failure**: Clear error message with retry guidance
- **Batch Failure**: Continue with remaining files, report failures
- **Memory Issues**: Automatic batch size reduction with warning
## 💻 CLI Interface Flows
### Flow 1: High-Performance Transcription
#### Command: Single File Processing
```bash
# Basic v2 transcription
trax transcribe --v2 audio.mp3
# With diarization
trax transcribe --v2 --diarize audio.mp3
# With domain-specific model
trax transcribe --v2 --domain technical audio.mp3
# With custom quality threshold
trax transcribe --v2 --accuracy 0.995 audio.mp3
```
#### Progress Reporting
- **Real-time progress**: Stage-by-stage progress with time estimates
- **Performance metrics**: CPU usage, memory usage, processing speed
- **Quality indicators**: Confidence scores, accuracy estimates
- **Error reporting**: Clear error messages with retry guidance
### Flow 2: Batch Processing with Diarization
#### Command: Batch Processing
```bash
# Process directory of files
trax batch --v2 --diarize /path/to/media/files/
# With parallel processing
trax batch --v2 --workers 4 --diarize /path/to/media/files/
# With domain detection
trax batch --v2 --auto-domain --diarize /path/to/media/files/
```
#### Batch Progress Reporting
- **Overall progress**: Total batch completion percentage
- **Current file**: Currently processing file with stage
- **Diarization status**: Speaker count, processing stage
- **Queue status**: Files remaining, completed, failed
- **Performance metrics**: Average processing time, accuracy
## 🔄 Data Flow & State Management
### Enhanced Data Models (PostgreSQL Schema)
#### Transcript (Enhanced for v2)
```json
{
"id": "UUID (required, primary key)",
"media_file_id": "UUID (required, foreign key)",
"pipeline_version": "string (required, v1, v2, v2+)",
"raw_content": "JSONB (required, Whisper output)",
"enhanced_content": "JSONB (optional, AI enhanced)",
"diarization_content": "JSONB (optional, Pyannote output)",
"merged_content": "JSONB (required, final transcript with speakers)",
"text_content": "text (required, plain text for search)",
"model_used": "string (required, whisper model version)",
"domain_used": "string (optional, technical, medical, etc.)",
"processing_time_ms": "integer (required)",
"word_count": "integer (required)",
"accuracy_estimate": "float (optional, 0.0-1.0)",
"confidence_scores": "JSONB (optional, per-segment confidence)",
"speaker_count": "integer (optional, number of speakers detected)",
"quality_warnings": "string array (optional)",
"processing_metadata": "JSONB (optional, version-specific data)",
"created_at": "timestamp (auto-generated)",
"enhanced_at": "timestamp (optional)",
"diarized_at": "timestamp (optional)",
"updated_at": "timestamp (auto-updated)"
}
```
#### SpeakerProfile (New for v2)
```json
{
"id": "UUID (required, primary key)",
"transcript_id": "UUID (required, foreign key)",
"speaker_id": "string (required, speaker label)",
"embedding_vector": "JSONB (required, speaker embedding)",
"speech_segments": "JSONB (required, time segments)",
"total_duration": "float (required, seconds)",
"word_count": "integer (required)",
"confidence_score": "float (optional, 0.0-1.0)",
"created_at": "timestamp (auto-generated)"
}
```
#### ProcessingJob (New for v2)
```json
{
"id": "UUID (required, primary key)",
"media_file_id": "UUID (required, foreign key)",
"pipeline_config": "JSONB (required, processing parameters)",
"status": "enum (queued, processing, completed, failed)",
"current_stage": "string (optional, current pipeline stage)",
"progress_percentage": "float (optional, 0.0-100.0)",
"error_message": "text (optional)",
"started_at": "timestamp (optional)",
"completed_at": "timestamp (optional)",
"created_at": "timestamp (auto-generated)",
"updated_at": "timestamp (auto-updated)"
}
```
### Enhanced State Transitions
#### ProcessingJob State Machine
```
[queued] → [processing] → [transcribing] → [enhancing] → [diarizing] → [merging] → [completed]
[queued] → [processing] → [failed] → [retry] → [processing]
```
#### Transcript State Machine (Enhanced)
```
[processing] → [transcribed] → [enhanced] → [diarized] → [merged] → [completed]
[processing] → [transcribed] → [enhanced] → [completed] (no diarization)
[processing] → [failed] → [retry] → [processing]
```
### Data Validation Rules
- **Rule 1**: Processing time must be >0 and <1800 seconds (30 minutes)
- **Rule 2**: Accuracy estimate must be between 0.0 and 1.0
- **Rule 3**: Speaker count must be 1 if diarization is enabled
- **Rule 4**: Confidence scores must be between 0.0 and 1.0
- **Rule 5**: Domain must be valid if specified
## 🧪 Testing Requirements
### Unit Tests
- [ ] `test_multi_pass_pipeline`: Test all pipeline stages and transitions
- [ ] `test_diarization_service`: Test Pyannote.audio integration
- [ ] `test_lora_adapter_manager`: Test domain-specific model loading
- [ ] `test_confidence_scoring`: Test confidence calculation and thresholding
- [ ] `test_web_interface`: Test Flask/FastAPI endpoints
- [ ] `test_parallel_processing`: Test concurrent pipeline execution
### Integration Tests
- [ ] `test_pipeline_v2_complete`: End-to-end v2 transcription with diarization
- [ ] `test_domain_adaptation`: Test LoRA adapter switching and accuracy
- [ ] `test_batch_processing_v2`: Process 10 files with v2 pipeline
- [ ] `test_cli_batch_processing`: Test CLI batch processing with multiple files
- [ ] `test_performance_targets`: Verify <25 second processing time
### Edge Cases
- [ ] Single speaker in multi-speaker file: Should handle gracefully
- [ ] Poor audio quality with diarization: Should provide clear warnings
- [ ] Memory pressure during processing: Should handle gracefully
- [ ] LoRA adapter loading failure: Should fall back to base model
- [ ] CLI progress reporting: Should show real-time updates
- [ ] Large files with diarization: Should chunk appropriately
## 🚀 Implementation Phases
### Phase 1: Multi-Pass Pipeline Foundation (Weeks 1-2)
**Goal**: Implement core multi-pass transcription pipeline
- [ ] Enhanced task system with pipeline stages - Support complex multi-stage workflows
- [ ] ModelManager singleton for model caching - Prevent memory duplication
- [ ] Multi-pass implementation (fast + refinement + enhancement) - Achieve 99.5%+ accuracy
- [ ] Confidence scoring system - Identify low-confidence segments
- [ ] Performance optimization (8-bit quantization) - Reduce memory usage by 50%
### Phase 2: Speaker Diarization Integration (Weeks 3-4)
**Goal**: Integrate Pyannote.audio for speaker identification
- [ ] Pyannote.audio integration - 90%+ speaker identification accuracy
- [ ] Parallel diarization and transcription - Minimize total processing time
- [ ] Speaker embedding caching - Avoid model reloading
- [ ] Diarization-transcript merging - Combine timestamps and speaker labels
- [ ] Speaker profile storage - Track speakers across multiple files
### Phase 3: Domain Adaptation and LoRA (Weeks 5-6)
**Goal**: Implement domain-specific model adaptation
- [ ] LoRA adapter system - Lightweight domain-specific models
- [ ] Domain auto-detection - Automatic content analysis
- [ ] Pre-trained domain models - Technical, medical, academic domains
- [ ] Custom domain training - User-provided data support
- [ ] Domain switching optimization - Fast adapter loading
### Phase 4: Enhanced CLI Interface (Weeks 7-8)
**Goal**: Develop enhanced CLI interface with improved batch processing
- [ ] Enhanced progress reporting - Real-time stage updates
- [ ] Batch processing improvements - Configurable concurrency
- [ ] Detailed logging system - Configurable verbosity levels
- [ ] Performance monitoring - CPU/memory usage display
- [ ] Error handling improvements - Clear retry guidance
### Phase 5: Performance Optimization and Polish (Weeks 9-10)
**Goal**: Achieve performance targets and final polish
- [ ] Performance benchmarking - Verify <25 second processing time
- [ ] Memory optimization - Stay under 8GB peak usage
- [ ] Error handling refinement - Comprehensive error recovery
- [ ] Documentation and user guides - Complete documentation
- [ ] Final testing and validation - End-to-end testing
## 🔒 Security & Constraints
### Security Requirements
- **API Key Management**: Secure storage of all API keys (Whisper, DeepSeek, HuggingFace)
- **Local Access Only**: CLI interface only, no network exposure
- **File Access**: Local file system access only
- **Data Protection**: Encrypted storage for sensitive transcripts
- **Input Sanitization**: Validate all file paths, URLs, and user inputs
### Performance Constraints
- **Response Time**: <25 seconds for 5-minute audio (v2)
- **Accuracy Target**: 99.5%+ transcription accuracy
- **Diarization Accuracy**: 90%+ speaker identification accuracy
- **Memory Usage**: <8GB peak memory usage (M3 MacBook 16GB)
- **Parallel Workers**: 8 workers for optimal M3 performance
- **Model Loading**: <5 seconds for model switching
### Technical Constraints
- **File Formats**: mp3, mp4, wav, m4a, webm only
- **File Size**: Maximum 500MB per file
- **Audio Duration**: Maximum 2 hours per file
- **Network**: Download-first, no streaming processing
- **Storage**: Local storage required, no cloud-only processing
- **Single Node**: No distributed processing, single-machine architecture
- **YouTube**: Curl-based metadata extraction only
## ✅ Definition of Done
### Feature Complete
- [ ] All acceptance criteria met with real test files
- [ ] 99.5%+ accuracy achieved on test dataset
- [ ] 90%+ speaker identification accuracy achieved
- [ ] <25 second processing time for 5-minute files
- [ ] Unit tests passing with >80% coverage
- [ ] Integration tests passing with actual services
- [ ] Code review completed
- [ ] Documentation updated in rule files
- [ ] Performance benchmarks met
### Ready for Deployment
- [ ] Performance targets achieved (speed, accuracy, memory)
- [ ] Security review completed
- [ ] Error handling tested with edge cases
- [ ] User acceptance testing with real files
- [ ] CLI interface tested and functional
- [ ] Rollback plan prepared for v2 deployment
- [ ] Monitoring and logging configured
### Trax v2-Specific Criteria
- [ ] Multi-pass pipeline delivers 99.5%+ accuracy
- [ ] Speaker diarization works reliably across content types
- [ ] Domain adaptation improves accuracy for specialized content
- [ ] CLI interface provides superior user experience
- [ ] Performance targets met without distributed architecture
- [ ] Memory usage optimized for single-node deployment
- [ ] Backward compatibility maintained with v1 features
---
*This PRD is specifically designed for Trax v2, focusing on high performance and speaker diarization as the core differentiators while maintaining the simplicity and determinism of the single-node architecture.*