trax/.taskmaster/docs/prd-v2.0.md

18 KiB

Trax v2.0 PRD: High-Performance Transcription with Speaker Diarization

🎯 Product Vision

"We're building a high-performance personal transcription tool that delivers exceptional accuracy (99.5%+) and robust speaker diarization, enabling researchers to transform complex multi-speaker content into structured, searchable text with speaker identification."

🏗️ System Architecture Overview

Core Components

  • Data Layer: PostgreSQL with JSONB, SQLAlchemy registry pattern (inherited from v1)
  • Business Logic: Protocol-based services, async/await throughout, enhanced with multi-pass pipeline
  • Interface Layer: CLI-first with Click, batch processing focus
  • Integration Layer: Download-first architecture, curl-based YouTube metadata
  • AI Layer: Multi-stage refinement pipeline, Pyannote.audio diarization, LoRA domain adaptation

System Boundaries

  • What's In Scope: Local media processing, high-accuracy transcription, speaker diarization, domain-specific models, CLI interface
  • What's Out of Scope: Real-time streaming, cloud processing, multi-user support, distributed systems
  • Integration Points: Whisper API, DeepSeek API, Pyannote.audio, FFmpeg, PostgreSQL, YouTube (curl)

👥 User Profile

Primary User: Advanced Personal Researcher

  • Role: Individual researcher processing complex educational content with multiple speakers
  • Content Types: Tech podcasts, academic lectures, panel discussions, interviews, audiobooks
  • Workflow: Batch URL collection → Download → High-accuracy transcription with diarization → Study
  • Goals: 99.5%+ accuracy, speaker identification, fast processing, searchable content
  • Constraints: Local storage, API costs, processing time, single-node architecture

🔧 Functional Requirements

Feature 1: Multi-Pass Transcription Pipeline

Purpose

Achieve 99.5%+ accuracy through intelligent multi-stage processing

User Stories

  • As a researcher, I want ultra-high accuracy transcripts, so that I can rely on the content for detailed analysis
  • As a researcher, I want fast processing despite high accuracy, so that I can process large batches efficiently

Acceptance Criteria

  • Given a media file, When I run trax transcribe --v2 <file>, Then I get 99.5%+ accuracy transcript in <25 seconds
  • Given a multi-pass transcript, When I compare to v1, Then accuracy improves by ≥4.5%
  • Given a transcript with confidence scores, When I review, Then I can identify low-confidence segments

Input Validation Rules

  • File format: mp3, mp4, wav, m4a, webm - Error: "Unsupported format"
  • File size: ≤500MB - Error: "File too large, max 500MB"
  • Audio duration: >0.1 seconds - Error: "File too short or silent"

Business Logic Rules

  • Rule 1: First pass uses distil-small.en for speed (10-15 seconds)
  • Rule 2: Second pass uses distil-large-v3 for accuracy refinement
  • Rule 3: Third pass uses DeepSeek for context-aware enhancement
  • Rule 4: Confidence scoring identifies segments needing refinement
  • Rule 5: Parallel processing of independent pipeline stages

Error Handling

  • Memory Pressure: Automatically reduce batch size and retry
  • Model Loading Failure: Fall back to v1 pipeline with warning
  • Processing Failure: Save partial results, allow retry from last successful stage

Feature 2: Speaker Diarization with Pyannote.audio

Purpose

Identify and label different speakers in multi-speaker content

User Stories

  • As a researcher, I want speaker identification, so that I can follow conversations and discussions
  • As a researcher, I want accurate speaker labels, so that I can attribute quotes and ideas correctly

Acceptance Criteria

  • Given a multi-speaker file, When I run diarization, Then I get 90%+ speaker identification accuracy
  • Given a diarized transcript, When I view it, Then speaker labels are clearly marked and consistent
  • Given a diarization failure, When I retry, Then the system provides clear error guidance

Input Validation Rules

  • Audio quality: Must have detectable speech - Error: "No speech detected"
  • Speaker count: Must have ≥2 speakers for diarization - Error: "Single speaker detected"
  • Audio duration: ≥30 seconds for reliable diarization - Error: "Audio too short for diarization"

Business Logic Rules

  • Rule 1: Run diarization in parallel with transcription
  • Rule 2: Use Pyannote.audio with optimized parameters for speed
  • Rule 3: Cache speaker embedding model to avoid reloading
  • Rule 4: Merge diarization results with transcript timestamps
  • Rule 5: Provide speaker count estimation before processing

Error Handling

  • Diarization Failure: Continue with transcription only, mark as single speaker
  • Memory Issues: Reduce audio chunk size and retry
  • Model Loading: Provide clear instructions for HuggingFace token setup

Feature 3: Domain-Specific Model Adaptation (LoRA)

Purpose

Improve accuracy for specific content domains using lightweight model adaptation

User Stories

  • As a researcher, I want domain-specific accuracy, so that technical terms and jargon are correctly transcribed
  • As a researcher, I want flexible domain selection, so that I can optimize for different content types

Acceptance Criteria

  • Given a technical podcast, When I use technical domain, Then technical terms are more accurately transcribed
  • Given a medical lecture, When I use medical domain, Then medical terminology is correctly captured
  • Given a domain model, When I switch domains, Then the system loads the appropriate LoRA adapter

Input Validation Rules

  • Domain selection: Must be valid domain (technical, medical, academic, general) - Error: "Invalid domain"
  • LoRA availability: Domain model must be available - Error: "Domain model not available"

Business Logic Rules

  • Rule 1: Load base Whisper model once, swap LoRA adapters as needed
  • Rule 2: Cache LoRA adapters in memory for fast switching
  • Rule 3: Provide domain auto-detection based on content analysis
  • Rule 4: Allow custom domain training with user-provided data

Error Handling

  • LoRA Loading Failure: Fall back to base model with warning
  • Domain Detection Failure: Use general domain as default
  • Memory Issues: Unload unused adapters automatically

Feature 4: Enhanced CLI Interface

Purpose

Provide an enhanced command-line interface with improved batch processing and progress reporting

User Stories

  • As a researcher, I want enhanced CLI progress reporting, so that I can monitor long-running jobs effectively
  • As a researcher, I want improved batch processing, so that I can efficiently process multiple files

Acceptance Criteria

  • Given a batch of files, When I run batch processing, Then I see real-time progress for each file
  • Given a processing job, When I monitor progress, Then I see detailed stage information and performance metrics
  • Given a completed transcript, When I view it, Then I can see speaker labels and confidence scores in the output

Input Validation Rules

  • File processing: Max 500MB per file - Error: "File too large"
  • File types: mp3, mp4, wav, m4a, webm - Error: "Unsupported format"
  • Batch size: Max 50 files per batch - Error: "Batch too large"

Business Logic Rules

  • Rule 1: Real-time progress updates via CLI output
  • Rule 2: Batch processing with configurable concurrency
  • Rule 3: Detailed logging with configurable verbosity
  • Rule 4: Processing jobs use same pipeline as single files
  • Rule 5: Transcript output includes speaker diarization information

Error Handling

  • Processing Failure: Clear error message with retry guidance
  • Batch Failure: Continue with remaining files, report failures
  • Memory Issues: Automatic batch size reduction with warning

💻 CLI Interface Flows

Flow 1: High-Performance Transcription

Command: Single File Processing

# Basic v2 transcription
trax transcribe --v2 audio.mp3

# With diarization
trax transcribe --v2 --diarize audio.mp3

# With domain-specific model
trax transcribe --v2 --domain technical audio.mp3

# With custom quality threshold
trax transcribe --v2 --accuracy 0.995 audio.mp3

Progress Reporting

  • Real-time progress: Stage-by-stage progress with time estimates
  • Performance metrics: CPU usage, memory usage, processing speed
  • Quality indicators: Confidence scores, accuracy estimates
  • Error reporting: Clear error messages with retry guidance

Flow 2: Batch Processing with Diarization

Command: Batch Processing

# Process directory of files
trax batch --v2 --diarize /path/to/media/files/

# With parallel processing
trax batch --v2 --workers 4 --diarize /path/to/media/files/

# With domain detection
trax batch --v2 --auto-domain --diarize /path/to/media/files/

Batch Progress Reporting

  • Overall progress: Total batch completion percentage
  • Current file: Currently processing file with stage
  • Diarization status: Speaker count, processing stage
  • Queue status: Files remaining, completed, failed
  • Performance metrics: Average processing time, accuracy

🔄 Data Flow & State Management

Enhanced Data Models (PostgreSQL Schema)

Transcript (Enhanced for v2)

{
  "id": "UUID (required, primary key)",
  "media_file_id": "UUID (required, foreign key)",
  "pipeline_version": "string (required, v1, v2, v2+)",
  "raw_content": "JSONB (required, Whisper output)",
  "enhanced_content": "JSONB (optional, AI enhanced)",
  "diarization_content": "JSONB (optional, Pyannote output)",
  "merged_content": "JSONB (required, final transcript with speakers)",
  "text_content": "text (required, plain text for search)",
  "model_used": "string (required, whisper model version)",
  "domain_used": "string (optional, technical, medical, etc.)",
  "processing_time_ms": "integer (required)",
  "word_count": "integer (required)",
  "accuracy_estimate": "float (optional, 0.0-1.0)",
  "confidence_scores": "JSONB (optional, per-segment confidence)",
  "speaker_count": "integer (optional, number of speakers detected)",
  "quality_warnings": "string array (optional)",
  "processing_metadata": "JSONB (optional, version-specific data)",
  "created_at": "timestamp (auto-generated)",
  "enhanced_at": "timestamp (optional)",
  "diarized_at": "timestamp (optional)",
  "updated_at": "timestamp (auto-updated)"
}

SpeakerProfile (New for v2)

{
  "id": "UUID (required, primary key)",
  "transcript_id": "UUID (required, foreign key)",
  "speaker_id": "string (required, speaker label)",
  "embedding_vector": "JSONB (required, speaker embedding)",
  "speech_segments": "JSONB (required, time segments)",
  "total_duration": "float (required, seconds)",
  "word_count": "integer (required)",
  "confidence_score": "float (optional, 0.0-1.0)",
  "created_at": "timestamp (auto-generated)"
}

ProcessingJob (New for v2)

{
  "id": "UUID (required, primary key)",
  "media_file_id": "UUID (required, foreign key)",
  "pipeline_config": "JSONB (required, processing parameters)",
  "status": "enum (queued, processing, completed, failed)",
  "current_stage": "string (optional, current pipeline stage)",
  "progress_percentage": "float (optional, 0.0-100.0)",
  "error_message": "text (optional)",
  "started_at": "timestamp (optional)",
  "completed_at": "timestamp (optional)",
  "created_at": "timestamp (auto-generated)",
  "updated_at": "timestamp (auto-updated)"
}

Enhanced State Transitions

ProcessingJob State Machine

[queued] → [processing] → [transcribing] → [enhancing] → [diarizing] → [merging] → [completed]
[queued] → [processing] → [failed] → [retry] → [processing]

Transcript State Machine (Enhanced)

[processing] → [transcribed] → [enhanced] → [diarized] → [merged] → [completed]
[processing] → [transcribed] → [enhanced] → [completed] (no diarization)
[processing] → [failed] → [retry] → [processing]

Data Validation Rules

  • Rule 1: Processing time must be >0 and <1800 seconds (30 minutes)
  • Rule 2: Accuracy estimate must be between 0.0 and 1.0
  • Rule 3: Speaker count must be ≥1 if diarization is enabled
  • Rule 4: Confidence scores must be between 0.0 and 1.0
  • Rule 5: Domain must be valid if specified

🧪 Testing Requirements

Unit Tests

  • test_multi_pass_pipeline: Test all pipeline stages and transitions
  • test_diarization_service: Test Pyannote.audio integration
  • test_lora_adapter_manager: Test domain-specific model loading
  • test_confidence_scoring: Test confidence calculation and thresholding
  • test_web_interface: Test Flask/FastAPI endpoints
  • test_parallel_processing: Test concurrent pipeline execution

Integration Tests

  • test_pipeline_v2_complete: End-to-end v2 transcription with diarization
  • test_domain_adaptation: Test LoRA adapter switching and accuracy
  • test_batch_processing_v2: Process 10 files with v2 pipeline
  • test_cli_batch_processing: Test CLI batch processing with multiple files
  • test_performance_targets: Verify <25 second processing time

Edge Cases

  • Single speaker in multi-speaker file: Should handle gracefully
  • Poor audio quality with diarization: Should provide clear warnings
  • Memory pressure during processing: Should handle gracefully
  • LoRA adapter loading failure: Should fall back to base model
  • CLI progress reporting: Should show real-time updates
  • Large files with diarization: Should chunk appropriately

🚀 Implementation Phases

Phase 1: Multi-Pass Pipeline Foundation (Weeks 1-2)

Goal: Implement core multi-pass transcription pipeline

  • Enhanced task system with pipeline stages - Support complex multi-stage workflows
  • ModelManager singleton for model caching - Prevent memory duplication
  • Multi-pass implementation (fast + refinement + enhancement) - Achieve 99.5%+ accuracy
  • Confidence scoring system - Identify low-confidence segments
  • Performance optimization (8-bit quantization) - Reduce memory usage by 50%

Phase 2: Speaker Diarization Integration (Weeks 3-4)

Goal: Integrate Pyannote.audio for speaker identification

  • Pyannote.audio integration - 90%+ speaker identification accuracy
  • Parallel diarization and transcription - Minimize total processing time
  • Speaker embedding caching - Avoid model reloading
  • Diarization-transcript merging - Combine timestamps and speaker labels
  • Speaker profile storage - Track speakers across multiple files

Phase 3: Domain Adaptation and LoRA (Weeks 5-6)

Goal: Implement domain-specific model adaptation

  • LoRA adapter system - Lightweight domain-specific models
  • Domain auto-detection - Automatic content analysis
  • Pre-trained domain models - Technical, medical, academic domains
  • Custom domain training - User-provided data support
  • Domain switching optimization - Fast adapter loading

Phase 4: Enhanced CLI Interface (Weeks 7-8)

Goal: Develop enhanced CLI interface with improved batch processing

  • Enhanced progress reporting - Real-time stage updates
  • Batch processing improvements - Configurable concurrency
  • Detailed logging system - Configurable verbosity levels
  • Performance monitoring - CPU/memory usage display
  • Error handling improvements - Clear retry guidance

Phase 5: Performance Optimization and Polish (Weeks 9-10)

Goal: Achieve performance targets and final polish

  • Performance benchmarking - Verify <25 second processing time
  • Memory optimization - Stay under 8GB peak usage
  • Error handling refinement - Comprehensive error recovery
  • Documentation and user guides - Complete documentation
  • Final testing and validation - End-to-end testing

🔒 Security & Constraints

Security Requirements

  • API Key Management: Secure storage of all API keys (Whisper, DeepSeek, HuggingFace)
  • Local Access Only: CLI interface only, no network exposure
  • File Access: Local file system access only
  • Data Protection: Encrypted storage for sensitive transcripts
  • Input Sanitization: Validate all file paths, URLs, and user inputs

Performance Constraints

  • Response Time: <25 seconds for 5-minute audio (v2)
  • Accuracy Target: 99.5%+ transcription accuracy
  • Diarization Accuracy: 90%+ speaker identification accuracy
  • Memory Usage: <8GB peak memory usage (M3 MacBook 16GB)
  • Parallel Workers: 8 workers for optimal M3 performance
  • Model Loading: <5 seconds for model switching

Technical Constraints

  • File Formats: mp3, mp4, wav, m4a, webm only
  • File Size: Maximum 500MB per file
  • Audio Duration: Maximum 2 hours per file
  • Network: Download-first, no streaming processing
  • Storage: Local storage required, no cloud-only processing
  • Single Node: No distributed processing, single-machine architecture
  • YouTube: Curl-based metadata extraction only

Definition of Done

Feature Complete

  • All acceptance criteria met with real test files
  • 99.5%+ accuracy achieved on test dataset
  • 90%+ speaker identification accuracy achieved
  • <25 second processing time for 5-minute files
  • Unit tests passing with >80% coverage
  • Integration tests passing with actual services
  • Code review completed
  • Documentation updated in rule files
  • Performance benchmarks met

Ready for Deployment

  • Performance targets achieved (speed, accuracy, memory)
  • Security review completed
  • Error handling tested with edge cases
  • User acceptance testing with real files
  • CLI interface tested and functional
  • Rollback plan prepared for v2 deployment
  • Monitoring and logging configured

Trax v2-Specific Criteria

  • Multi-pass pipeline delivers 99.5%+ accuracy
  • Speaker diarization works reliably across content types
  • Domain adaptation improves accuracy for specialized content
  • CLI interface provides superior user experience
  • Performance targets met without distributed architecture
  • Memory usage optimized for single-node deployment
  • Backward compatibility maintained with v1 features

This PRD is specifically designed for Trax v2, focusing on high performance and speaker diarization as the core differentiators while maintaining the simplicity and determinism of the single-node architecture.