401 lines
18 KiB
Markdown
401 lines
18 KiB
Markdown
# Trax v2.0 PRD: High-Performance Transcription with Speaker Diarization
|
|
|
|
## 🎯 Product Vision
|
|
*"We're building a high-performance personal transcription tool that delivers exceptional accuracy (99.5%+) and robust speaker diarization, enabling researchers to transform complex multi-speaker content into structured, searchable text with speaker identification."*
|
|
|
|
## 🏗️ System Architecture Overview
|
|
### Core Components
|
|
- **Data Layer**: PostgreSQL with JSONB, SQLAlchemy registry pattern (inherited from v1)
|
|
- **Business Logic**: Protocol-based services, async/await throughout, enhanced with multi-pass pipeline
|
|
- **Interface Layer**: CLI-first with Click, batch processing focus
|
|
- **Integration Layer**: Download-first architecture, curl-based YouTube metadata
|
|
- **AI Layer**: Multi-stage refinement pipeline, Pyannote.audio diarization, LoRA domain adaptation
|
|
|
|
### System Boundaries
|
|
- **What's In Scope**: Local media processing, high-accuracy transcription, speaker diarization, domain-specific models, CLI interface
|
|
- **What's Out of Scope**: Real-time streaming, cloud processing, multi-user support, distributed systems
|
|
- **Integration Points**: Whisper API, DeepSeek API, Pyannote.audio, FFmpeg, PostgreSQL, YouTube (curl)
|
|
|
|
## 👥 User Profile
|
|
### Primary User: Advanced Personal Researcher
|
|
- **Role**: Individual researcher processing complex educational content with multiple speakers
|
|
- **Content Types**: Tech podcasts, academic lectures, panel discussions, interviews, audiobooks
|
|
- **Workflow**: Batch URL collection → Download → High-accuracy transcription with diarization → Study
|
|
- **Goals**: 99.5%+ accuracy, speaker identification, fast processing, searchable content
|
|
- **Constraints**: Local storage, API costs, processing time, single-node architecture
|
|
|
|
## 🔧 Functional Requirements
|
|
|
|
### Feature 1: Multi-Pass Transcription Pipeline
|
|
#### Purpose
|
|
Achieve 99.5%+ accuracy through intelligent multi-stage processing
|
|
|
|
#### User Stories
|
|
- **As a** researcher, **I want** ultra-high accuracy transcripts, **so that** I can rely on the content for detailed analysis
|
|
- **As a** researcher, **I want** fast processing despite high accuracy, **so that** I can process large batches efficiently
|
|
|
|
#### Acceptance Criteria
|
|
- [ ] **Given** a media file, **When** I run `trax transcribe --v2 <file>`, **Then** I get 99.5%+ accuracy transcript in <25 seconds
|
|
- [ ] **Given** a multi-pass transcript, **When** I compare to v1, **Then** accuracy improves by ≥4.5%
|
|
- [ ] **Given** a transcript with confidence scores, **When** I review, **Then** I can identify low-confidence segments
|
|
|
|
#### Input Validation Rules
|
|
- **File format**: mp3, mp4, wav, m4a, webm - Error: "Unsupported format"
|
|
- **File size**: ≤500MB - Error: "File too large, max 500MB"
|
|
- **Audio duration**: >0.1 seconds - Error: "File too short or silent"
|
|
|
|
#### Business Logic Rules
|
|
- **Rule 1**: First pass uses distil-small.en for speed (10-15 seconds)
|
|
- **Rule 2**: Second pass uses distil-large-v3 for accuracy refinement
|
|
- **Rule 3**: Third pass uses DeepSeek for context-aware enhancement
|
|
- **Rule 4**: Confidence scoring identifies segments needing refinement
|
|
- **Rule 5**: Parallel processing of independent pipeline stages
|
|
|
|
#### Error Handling
|
|
- **Memory Pressure**: Automatically reduce batch size and retry
|
|
- **Model Loading Failure**: Fall back to v1 pipeline with warning
|
|
- **Processing Failure**: Save partial results, allow retry from last successful stage
|
|
|
|
### Feature 2: Speaker Diarization with Pyannote.audio
|
|
#### Purpose
|
|
Identify and label different speakers in multi-speaker content
|
|
|
|
#### User Stories
|
|
- **As a** researcher, **I want** speaker identification, **so that** I can follow conversations and discussions
|
|
- **As a** researcher, **I want** accurate speaker labels, **so that** I can attribute quotes and ideas correctly
|
|
|
|
#### Acceptance Criteria
|
|
- [ ] **Given** a multi-speaker file, **When** I run diarization, **Then** I get 90%+ speaker identification accuracy
|
|
- [ ] **Given** a diarized transcript, **When** I view it, **Then** speaker labels are clearly marked and consistent
|
|
- [ ] **Given** a diarization failure, **When** I retry, **Then** the system provides clear error guidance
|
|
|
|
#### Input Validation Rules
|
|
- **Audio quality**: Must have detectable speech - Error: "No speech detected"
|
|
- **Speaker count**: Must have ≥2 speakers for diarization - Error: "Single speaker detected"
|
|
- **Audio duration**: ≥30 seconds for reliable diarization - Error: "Audio too short for diarization"
|
|
|
|
#### Business Logic Rules
|
|
- **Rule 1**: Run diarization in parallel with transcription
|
|
- **Rule 2**: Use Pyannote.audio with optimized parameters for speed
|
|
- **Rule 3**: Cache speaker embedding model to avoid reloading
|
|
- **Rule 4**: Merge diarization results with transcript timestamps
|
|
- **Rule 5**: Provide speaker count estimation before processing
|
|
|
|
#### Error Handling
|
|
- **Diarization Failure**: Continue with transcription only, mark as single speaker
|
|
- **Memory Issues**: Reduce audio chunk size and retry
|
|
- **Model Loading**: Provide clear instructions for HuggingFace token setup
|
|
|
|
### Feature 3: Domain-Specific Model Adaptation (LoRA)
|
|
#### Purpose
|
|
Improve accuracy for specific content domains using lightweight model adaptation
|
|
|
|
#### User Stories
|
|
- **As a** researcher, **I want** domain-specific accuracy, **so that** technical terms and jargon are correctly transcribed
|
|
- **As a** researcher, **I want** flexible domain selection, **so that** I can optimize for different content types
|
|
|
|
#### Acceptance Criteria
|
|
- [ ] **Given** a technical podcast, **When** I use technical domain, **Then** technical terms are more accurately transcribed
|
|
- [ ] **Given** a medical lecture, **When** I use medical domain, **Then** medical terminology is correctly captured
|
|
- [ ] **Given** a domain model, **When** I switch domains, **Then** the system loads the appropriate LoRA adapter
|
|
|
|
#### Input Validation Rules
|
|
- **Domain selection**: Must be valid domain (technical, medical, academic, general) - Error: "Invalid domain"
|
|
- **LoRA availability**: Domain model must be available - Error: "Domain model not available"
|
|
|
|
#### Business Logic Rules
|
|
- **Rule 1**: Load base Whisper model once, swap LoRA adapters as needed
|
|
- **Rule 2**: Cache LoRA adapters in memory for fast switching
|
|
- **Rule 3**: Provide domain auto-detection based on content analysis
|
|
- **Rule 4**: Allow custom domain training with user-provided data
|
|
|
|
#### Error Handling
|
|
- **LoRA Loading Failure**: Fall back to base model with warning
|
|
- **Domain Detection Failure**: Use general domain as default
|
|
- **Memory Issues**: Unload unused adapters automatically
|
|
|
|
### Feature 4: Enhanced CLI Interface
|
|
#### Purpose
|
|
Provide an enhanced command-line interface with improved batch processing and progress reporting
|
|
|
|
#### User Stories
|
|
- **As a** researcher, **I want** enhanced CLI progress reporting, **so that** I can monitor long-running jobs effectively
|
|
- **As a** researcher, **I want** improved batch processing, **so that** I can efficiently process multiple files
|
|
|
|
#### Acceptance Criteria
|
|
- [ ] **Given** a batch of files, **When** I run batch processing, **Then** I see real-time progress for each file
|
|
- [ ] **Given** a processing job, **When** I monitor progress, **Then** I see detailed stage information and performance metrics
|
|
- [ ] **Given** a completed transcript, **When** I view it, **Then** I can see speaker labels and confidence scores in the output
|
|
|
|
#### Input Validation Rules
|
|
- **File processing**: Max 500MB per file - Error: "File too large"
|
|
- **File types**: mp3, mp4, wav, m4a, webm - Error: "Unsupported format"
|
|
- **Batch size**: Max 50 files per batch - Error: "Batch too large"
|
|
|
|
#### Business Logic Rules
|
|
- **Rule 1**: Real-time progress updates via CLI output
|
|
- **Rule 2**: Batch processing with configurable concurrency
|
|
- **Rule 3**: Detailed logging with configurable verbosity
|
|
- **Rule 4**: Processing jobs use same pipeline as single files
|
|
- **Rule 5**: Transcript output includes speaker diarization information
|
|
|
|
#### Error Handling
|
|
- **Processing Failure**: Clear error message with retry guidance
|
|
- **Batch Failure**: Continue with remaining files, report failures
|
|
- **Memory Issues**: Automatic batch size reduction with warning
|
|
|
|
## 💻 CLI Interface Flows
|
|
|
|
### Flow 1: High-Performance Transcription
|
|
#### Command: Single File Processing
|
|
```bash
|
|
# Basic v2 transcription
|
|
trax transcribe --v2 audio.mp3
|
|
|
|
# With diarization
|
|
trax transcribe --v2 --diarize audio.mp3
|
|
|
|
# With domain-specific model
|
|
trax transcribe --v2 --domain technical audio.mp3
|
|
|
|
# With custom quality threshold
|
|
trax transcribe --v2 --accuracy 0.995 audio.mp3
|
|
```
|
|
|
|
#### Progress Reporting
|
|
- **Real-time progress**: Stage-by-stage progress with time estimates
|
|
- **Performance metrics**: CPU usage, memory usage, processing speed
|
|
- **Quality indicators**: Confidence scores, accuracy estimates
|
|
- **Error reporting**: Clear error messages with retry guidance
|
|
|
|
### Flow 2: Batch Processing with Diarization
|
|
#### Command: Batch Processing
|
|
```bash
|
|
# Process directory of files
|
|
trax batch --v2 --diarize /path/to/media/files/
|
|
|
|
# With parallel processing
|
|
trax batch --v2 --workers 4 --diarize /path/to/media/files/
|
|
|
|
# With domain detection
|
|
trax batch --v2 --auto-domain --diarize /path/to/media/files/
|
|
```
|
|
|
|
#### Batch Progress Reporting
|
|
- **Overall progress**: Total batch completion percentage
|
|
- **Current file**: Currently processing file with stage
|
|
- **Diarization status**: Speaker count, processing stage
|
|
- **Queue status**: Files remaining, completed, failed
|
|
- **Performance metrics**: Average processing time, accuracy
|
|
|
|
## 🔄 Data Flow & State Management
|
|
|
|
### Enhanced Data Models (PostgreSQL Schema)
|
|
#### Transcript (Enhanced for v2)
|
|
```json
|
|
{
|
|
"id": "UUID (required, primary key)",
|
|
"media_file_id": "UUID (required, foreign key)",
|
|
"pipeline_version": "string (required, v1, v2, v2+)",
|
|
"raw_content": "JSONB (required, Whisper output)",
|
|
"enhanced_content": "JSONB (optional, AI enhanced)",
|
|
"diarization_content": "JSONB (optional, Pyannote output)",
|
|
"merged_content": "JSONB (required, final transcript with speakers)",
|
|
"text_content": "text (required, plain text for search)",
|
|
"model_used": "string (required, whisper model version)",
|
|
"domain_used": "string (optional, technical, medical, etc.)",
|
|
"processing_time_ms": "integer (required)",
|
|
"word_count": "integer (required)",
|
|
"accuracy_estimate": "float (optional, 0.0-1.0)",
|
|
"confidence_scores": "JSONB (optional, per-segment confidence)",
|
|
"speaker_count": "integer (optional, number of speakers detected)",
|
|
"quality_warnings": "string array (optional)",
|
|
"processing_metadata": "JSONB (optional, version-specific data)",
|
|
"created_at": "timestamp (auto-generated)",
|
|
"enhanced_at": "timestamp (optional)",
|
|
"diarized_at": "timestamp (optional)",
|
|
"updated_at": "timestamp (auto-updated)"
|
|
}
|
|
```
|
|
|
|
#### SpeakerProfile (New for v2)
|
|
```json
|
|
{
|
|
"id": "UUID (required, primary key)",
|
|
"transcript_id": "UUID (required, foreign key)",
|
|
"speaker_id": "string (required, speaker label)",
|
|
"embedding_vector": "JSONB (required, speaker embedding)",
|
|
"speech_segments": "JSONB (required, time segments)",
|
|
"total_duration": "float (required, seconds)",
|
|
"word_count": "integer (required)",
|
|
"confidence_score": "float (optional, 0.0-1.0)",
|
|
"created_at": "timestamp (auto-generated)"
|
|
}
|
|
```
|
|
|
|
#### ProcessingJob (New for v2)
|
|
```json
|
|
{
|
|
"id": "UUID (required, primary key)",
|
|
"media_file_id": "UUID (required, foreign key)",
|
|
"pipeline_config": "JSONB (required, processing parameters)",
|
|
"status": "enum (queued, processing, completed, failed)",
|
|
"current_stage": "string (optional, current pipeline stage)",
|
|
"progress_percentage": "float (optional, 0.0-100.0)",
|
|
"error_message": "text (optional)",
|
|
"started_at": "timestamp (optional)",
|
|
"completed_at": "timestamp (optional)",
|
|
"created_at": "timestamp (auto-generated)",
|
|
"updated_at": "timestamp (auto-updated)"
|
|
}
|
|
```
|
|
|
|
### Enhanced State Transitions
|
|
#### ProcessingJob State Machine
|
|
```
|
|
[queued] → [processing] → [transcribing] → [enhancing] → [diarizing] → [merging] → [completed]
|
|
[queued] → [processing] → [failed] → [retry] → [processing]
|
|
```
|
|
|
|
#### Transcript State Machine (Enhanced)
|
|
```
|
|
[processing] → [transcribed] → [enhanced] → [diarized] → [merged] → [completed]
|
|
[processing] → [transcribed] → [enhanced] → [completed] (no diarization)
|
|
[processing] → [failed] → [retry] → [processing]
|
|
```
|
|
|
|
### Data Validation Rules
|
|
- **Rule 1**: Processing time must be >0 and <1800 seconds (30 minutes)
|
|
- **Rule 2**: Accuracy estimate must be between 0.0 and 1.0
|
|
- **Rule 3**: Speaker count must be ≥1 if diarization is enabled
|
|
- **Rule 4**: Confidence scores must be between 0.0 and 1.0
|
|
- **Rule 5**: Domain must be valid if specified
|
|
|
|
## 🧪 Testing Requirements
|
|
|
|
### Unit Tests
|
|
- [ ] `test_multi_pass_pipeline`: Test all pipeline stages and transitions
|
|
- [ ] `test_diarization_service`: Test Pyannote.audio integration
|
|
- [ ] `test_lora_adapter_manager`: Test domain-specific model loading
|
|
- [ ] `test_confidence_scoring`: Test confidence calculation and thresholding
|
|
- [ ] `test_web_interface`: Test Flask/FastAPI endpoints
|
|
- [ ] `test_parallel_processing`: Test concurrent pipeline execution
|
|
|
|
### Integration Tests
|
|
- [ ] `test_pipeline_v2_complete`: End-to-end v2 transcription with diarization
|
|
- [ ] `test_domain_adaptation`: Test LoRA adapter switching and accuracy
|
|
- [ ] `test_batch_processing_v2`: Process 10 files with v2 pipeline
|
|
- [ ] `test_cli_batch_processing`: Test CLI batch processing with multiple files
|
|
- [ ] `test_performance_targets`: Verify <25 second processing time
|
|
|
|
### Edge Cases
|
|
- [ ] Single speaker in multi-speaker file: Should handle gracefully
|
|
- [ ] Poor audio quality with diarization: Should provide clear warnings
|
|
- [ ] Memory pressure during processing: Should handle gracefully
|
|
- [ ] LoRA adapter loading failure: Should fall back to base model
|
|
- [ ] CLI progress reporting: Should show real-time updates
|
|
- [ ] Large files with diarization: Should chunk appropriately
|
|
|
|
## 🚀 Implementation Phases
|
|
|
|
### Phase 1: Multi-Pass Pipeline Foundation (Weeks 1-2)
|
|
**Goal**: Implement core multi-pass transcription pipeline
|
|
- [ ] Enhanced task system with pipeline stages - Support complex multi-stage workflows
|
|
- [ ] ModelManager singleton for model caching - Prevent memory duplication
|
|
- [ ] Multi-pass implementation (fast + refinement + enhancement) - Achieve 99.5%+ accuracy
|
|
- [ ] Confidence scoring system - Identify low-confidence segments
|
|
- [ ] Performance optimization (8-bit quantization) - Reduce memory usage by 50%
|
|
|
|
### Phase 2: Speaker Diarization Integration (Weeks 3-4)
|
|
**Goal**: Integrate Pyannote.audio for speaker identification
|
|
- [ ] Pyannote.audio integration - 90%+ speaker identification accuracy
|
|
- [ ] Parallel diarization and transcription - Minimize total processing time
|
|
- [ ] Speaker embedding caching - Avoid model reloading
|
|
- [ ] Diarization-transcript merging - Combine timestamps and speaker labels
|
|
- [ ] Speaker profile storage - Track speakers across multiple files
|
|
|
|
### Phase 3: Domain Adaptation and LoRA (Weeks 5-6)
|
|
**Goal**: Implement domain-specific model adaptation
|
|
- [ ] LoRA adapter system - Lightweight domain-specific models
|
|
- [ ] Domain auto-detection - Automatic content analysis
|
|
- [ ] Pre-trained domain models - Technical, medical, academic domains
|
|
- [ ] Custom domain training - User-provided data support
|
|
- [ ] Domain switching optimization - Fast adapter loading
|
|
|
|
### Phase 4: Enhanced CLI Interface (Weeks 7-8)
|
|
**Goal**: Develop enhanced CLI interface with improved batch processing
|
|
- [ ] Enhanced progress reporting - Real-time stage updates
|
|
- [ ] Batch processing improvements - Configurable concurrency
|
|
- [ ] Detailed logging system - Configurable verbosity levels
|
|
- [ ] Performance monitoring - CPU/memory usage display
|
|
- [ ] Error handling improvements - Clear retry guidance
|
|
|
|
### Phase 5: Performance Optimization and Polish (Weeks 9-10)
|
|
**Goal**: Achieve performance targets and final polish
|
|
- [ ] Performance benchmarking - Verify <25 second processing time
|
|
- [ ] Memory optimization - Stay under 8GB peak usage
|
|
- [ ] Error handling refinement - Comprehensive error recovery
|
|
- [ ] Documentation and user guides - Complete documentation
|
|
- [ ] Final testing and validation - End-to-end testing
|
|
|
|
## 🔒 Security & Constraints
|
|
|
|
### Security Requirements
|
|
- **API Key Management**: Secure storage of all API keys (Whisper, DeepSeek, HuggingFace)
|
|
- **Local Access Only**: CLI interface only, no network exposure
|
|
- **File Access**: Local file system access only
|
|
- **Data Protection**: Encrypted storage for sensitive transcripts
|
|
- **Input Sanitization**: Validate all file paths, URLs, and user inputs
|
|
|
|
### Performance Constraints
|
|
- **Response Time**: <25 seconds for 5-minute audio (v2)
|
|
- **Accuracy Target**: 99.5%+ transcription accuracy
|
|
- **Diarization Accuracy**: 90%+ speaker identification accuracy
|
|
- **Memory Usage**: <8GB peak memory usage (M3 MacBook 16GB)
|
|
- **Parallel Workers**: 8 workers for optimal M3 performance
|
|
- **Model Loading**: <5 seconds for model switching
|
|
|
|
### Technical Constraints
|
|
- **File Formats**: mp3, mp4, wav, m4a, webm only
|
|
- **File Size**: Maximum 500MB per file
|
|
- **Audio Duration**: Maximum 2 hours per file
|
|
- **Network**: Download-first, no streaming processing
|
|
- **Storage**: Local storage required, no cloud-only processing
|
|
- **Single Node**: No distributed processing, single-machine architecture
|
|
- **YouTube**: Curl-based metadata extraction only
|
|
|
|
## ✅ Definition of Done
|
|
|
|
### Feature Complete
|
|
- [ ] All acceptance criteria met with real test files
|
|
- [ ] 99.5%+ accuracy achieved on test dataset
|
|
- [ ] 90%+ speaker identification accuracy achieved
|
|
- [ ] <25 second processing time for 5-minute files
|
|
- [ ] Unit tests passing with >80% coverage
|
|
- [ ] Integration tests passing with actual services
|
|
- [ ] Code review completed
|
|
- [ ] Documentation updated in rule files
|
|
- [ ] Performance benchmarks met
|
|
|
|
### Ready for Deployment
|
|
- [ ] Performance targets achieved (speed, accuracy, memory)
|
|
- [ ] Security review completed
|
|
- [ ] Error handling tested with edge cases
|
|
- [ ] User acceptance testing with real files
|
|
- [ ] CLI interface tested and functional
|
|
- [ ] Rollback plan prepared for v2 deployment
|
|
- [ ] Monitoring and logging configured
|
|
|
|
### Trax v2-Specific Criteria
|
|
- [ ] Multi-pass pipeline delivers 99.5%+ accuracy
|
|
- [ ] Speaker diarization works reliably across content types
|
|
- [ ] Domain adaptation improves accuracy for specialized content
|
|
- [ ] CLI interface provides superior user experience
|
|
- [ ] Performance targets met without distributed architecture
|
|
- [ ] Memory usage optimized for single-node deployment
|
|
- [ ] Backward compatibility maintained with v1 features
|
|
|
|
---
|
|
|
|
*This PRD is specifically designed for Trax v2, focusing on high performance and speaker diarization as the core differentiators while maintaining the simplicity and determinism of the single-node architecture.*
|