18 KiB

Raw Blame History

Trax v2.0 PRD: High-Performance Transcription with Speaker Diarization

🎯 Product Vision

"We're building a high-performance personal transcription tool that delivers exceptional accuracy (99.5%+) and robust speaker diarization, enabling researchers to transform complex multi-speaker content into structured, searchable text with speaker identification."

🏗️ System Architecture Overview

Core Components

Data Layer: PostgreSQL with JSONB, SQLAlchemy registry pattern (inherited from v1)
Business Logic: Protocol-based services, async/await throughout, enhanced with multi-pass pipeline
Interface Layer: CLI-first with Click, batch processing focus
Integration Layer: Download-first architecture, curl-based YouTube metadata
AI Layer: Multi-stage refinement pipeline, Pyannote.audio diarization, LoRA domain adaptation

System Boundaries

What's In Scope: Local media processing, high-accuracy transcription, speaker diarization, domain-specific models, CLI interface
What's Out of Scope: Real-time streaming, cloud processing, multi-user support, distributed systems
Integration Points: Whisper API, DeepSeek API, Pyannote.audio, FFmpeg, PostgreSQL, YouTube (curl)

👥 User Profile

Primary User: Advanced Personal Researcher

Role: Individual researcher processing complex educational content with multiple speakers
Content Types: Tech podcasts, academic lectures, panel discussions, interviews, audiobooks
Workflow: Batch URL collection → Download → High-accuracy transcription with diarization → Study
Goals: 99.5%+ accuracy, speaker identification, fast processing, searchable content
Constraints: Local storage, API costs, processing time, single-node architecture

🔧 Functional Requirements

Feature 1: Multi-Pass Transcription Pipeline

Purpose

Achieve 99.5%+ accuracy through intelligent multi-stage processing

User Stories

As a researcher, I want ultra-high accuracy transcripts, so that I can rely on the content for detailed analysis
As a researcher, I want fast processing despite high accuracy, so that I can process large batches efficiently

Acceptance Criteria

Given a media file, When I run trax transcribe --v2 <file>, Then I get 99.5%+ accuracy transcript in <25 seconds
Given a multi-pass transcript, When I compare to v1, Then accuracy improves by ≥4.5%
Given a transcript with confidence scores, When I review, Then I can identify low-confidence segments

Input Validation Rules

File format: mp3, mp4, wav, m4a, webm - Error: "Unsupported format"
File size: ≤500MB - Error: "File too large, max 500MB"
Audio duration: >0.1 seconds - Error: "File too short or silent"

Business Logic Rules

Rule 1: First pass uses distil-small.en for speed (10-15 seconds)
Rule 2: Second pass uses distil-large-v3 for accuracy refinement
Rule 3: Third pass uses DeepSeek for context-aware enhancement
Rule 4: Confidence scoring identifies segments needing refinement
Rule 5: Parallel processing of independent pipeline stages

Error Handling

Memory Pressure: Automatically reduce batch size and retry
Model Loading Failure: Fall back to v1 pipeline with warning
Processing Failure: Save partial results, allow retry from last successful stage

Feature 2: Speaker Diarization with Pyannote.audio

Purpose

Identify and label different speakers in multi-speaker content

User Stories

As a researcher, I want speaker identification, so that I can follow conversations and discussions
As a researcher, I want accurate speaker labels, so that I can attribute quotes and ideas correctly

Acceptance Criteria

Given a multi-speaker file, When I run diarization, Then I get 90%+ speaker identification accuracy
Given a diarized transcript, When I view it, Then speaker labels are clearly marked and consistent
Given a diarization failure, When I retry, Then the system provides clear error guidance

Input Validation Rules

Audio quality: Must have detectable speech - Error: "No speech detected"
Speaker count: Must have ≥2 speakers for diarization - Error: "Single speaker detected"
Audio duration: ≥30 seconds for reliable diarization - Error: "Audio too short for diarization"

Business Logic Rules

Rule 1: Run diarization in parallel with transcription
Rule 2: Use Pyannote.audio with optimized parameters for speed
Rule 3: Cache speaker embedding model to avoid reloading
Rule 4: Merge diarization results with transcript timestamps
Rule 5: Provide speaker count estimation before processing

Error Handling

Diarization Failure: Continue with transcription only, mark as single speaker
Memory Issues: Reduce audio chunk size and retry
Model Loading: Provide clear instructions for HuggingFace token setup

Feature 3: Domain-Specific Model Adaptation (LoRA)

Purpose

Improve accuracy for specific content domains using lightweight model adaptation

User Stories

As a researcher, I want domain-specific accuracy, so that technical terms and jargon are correctly transcribed
As a researcher, I want flexible domain selection, so that I can optimize for different content types

Acceptance Criteria

Given a technical podcast, When I use technical domain, Then technical terms are more accurately transcribed
Given a medical lecture, When I use medical domain, Then medical terminology is correctly captured
Given a domain model, When I switch domains, Then the system loads the appropriate LoRA adapter

Input Validation Rules

Domain selection: Must be valid domain (technical, medical, academic, general) - Error: "Invalid domain"
LoRA availability: Domain model must be available - Error: "Domain model not available"

Business Logic Rules

Rule 1: Load base Whisper model once, swap LoRA adapters as needed
Rule 2: Cache LoRA adapters in memory for fast switching
Rule 3: Provide domain auto-detection based on content analysis
Rule 4: Allow custom domain training with user-provided data

Error Handling

LoRA Loading Failure: Fall back to base model with warning
Domain Detection Failure: Use general domain as default
Memory Issues: Unload unused adapters automatically

Feature 4: Enhanced CLI Interface

Purpose

Provide an enhanced command-line interface with improved batch processing and progress reporting

User Stories

As a researcher, I want enhanced CLI progress reporting, so that I can monitor long-running jobs effectively
As a researcher, I want improved batch processing, so that I can efficiently process multiple files

Acceptance Criteria

Given a batch of files, When I run batch processing, Then I see real-time progress for each file
Given a processing job, When I monitor progress, Then I see detailed stage information and performance metrics
Given a completed transcript, When I view it, Then I can see speaker labels and confidence scores in the output

Input Validation Rules

File processing: Max 500MB per file - Error: "File too large"
File types: mp3, mp4, wav, m4a, webm - Error: "Unsupported format"
Batch size: Max 50 files per batch - Error: "Batch too large"

Business Logic Rules

Rule 1: Real-time progress updates via CLI output
Rule 2: Batch processing with configurable concurrency
Rule 3: Detailed logging with configurable verbosity
Rule 4: Processing jobs use same pipeline as single files
Rule 5: Transcript output includes speaker diarization information

Error Handling

Processing Failure: Clear error message with retry guidance
Batch Failure: Continue with remaining files, report failures
Memory Issues: Automatic batch size reduction with warning

💻 CLI Interface Flows

Flow 1: High-Performance Transcription

Command: Single File Processing

# Basic v2 transcription
trax transcribe --v2 audio.mp3

# With diarization
trax transcribe --v2 --diarize audio.mp3

# With domain-specific model
trax transcribe --v2 --domain technical audio.mp3

# With custom quality threshold
trax transcribe --v2 --accuracy 0.995 audio.mp3

Progress Reporting

Real-time progress: Stage-by-stage progress with time estimates
Performance metrics: CPU usage, memory usage, processing speed
Quality indicators: Confidence scores, accuracy estimates
Error reporting: Clear error messages with retry guidance

Flow 2: Batch Processing with Diarization

Command: Batch Processing

# Process directory of files
trax batch --v2 --diarize /path/to/media/files/

# With parallel processing
trax batch --v2 --workers 4 --diarize /path/to/media/files/

# With domain detection
trax batch --v2 --auto-domain --diarize /path/to/media/files/

Batch Progress Reporting

Overall progress: Total batch completion percentage
Current file: Currently processing file with stage
Diarization status: Speaker count, processing stage
Queue status: Files remaining, completed, failed
Performance metrics: Average processing time, accuracy

🔄 Data Flow & State Management

Enhanced Data Models (PostgreSQL Schema)

Transcript (Enhanced for v2)

{
  "id": "UUID (required, primary key)",
  "media_file_id": "UUID (required, foreign key)",
  "pipeline_version": "string (required, v1, v2, v2+)",
  "raw_content": "JSONB (required, Whisper output)",
  "enhanced_content": "JSONB (optional, AI enhanced)",
  "diarization_content": "JSONB (optional, Pyannote output)",
  "merged_content": "JSONB (required, final transcript with speakers)",
  "text_content": "text (required, plain text for search)",
  "model_used": "string (required, whisper model version)",
  "domain_used": "string (optional, technical, medical, etc.)",
  "processing_time_ms": "integer (required)",
  "word_count": "integer (required)",
  "accuracy_estimate": "float (optional, 0.0-1.0)",
  "confidence_scores": "JSONB (optional, per-segment confidence)",
  "speaker_count": "integer (optional, number of speakers detected)",
  "quality_warnings": "string array (optional)",
  "processing_metadata": "JSONB (optional, version-specific data)",
  "created_at": "timestamp (auto-generated)",
  "enhanced_at": "timestamp (optional)",
  "diarized_at": "timestamp (optional)",
  "updated_at": "timestamp (auto-updated)"
}

SpeakerProfile (New for v2)

{
  "id": "UUID (required, primary key)",
  "transcript_id": "UUID (required, foreign key)",
  "speaker_id": "string (required, speaker label)",
  "embedding_vector": "JSONB (required, speaker embedding)",
  "speech_segments": "JSONB (required, time segments)",
  "total_duration": "float (required, seconds)",
  "word_count": "integer (required)",
  "confidence_score": "float (optional, 0.0-1.0)",
  "created_at": "timestamp (auto-generated)"
}

ProcessingJob (New for v2)

{
  "id": "UUID (required, primary key)",
  "media_file_id": "UUID (required, foreign key)",
  "pipeline_config": "JSONB (required, processing parameters)",
  "status": "enum (queued, processing, completed, failed)",
  "current_stage": "string (optional, current pipeline stage)",
  "progress_percentage": "float (optional, 0.0-100.0)",
  "error_message": "text (optional)",
  "started_at": "timestamp (optional)",
  "completed_at": "timestamp (optional)",
  "created_at": "timestamp (auto-generated)",
  "updated_at": "timestamp (auto-updated)"
}

Enhanced State Transitions

ProcessingJob State Machine

[queued] → [processing] → [transcribing] → [enhancing] → [diarizing] → [merging] → [completed]
[queued] → [processing] → [failed] → [retry] → [processing]

Transcript State Machine (Enhanced)

[processing] → [transcribed] → [enhanced] → [diarized] → [merged] → [completed]
[processing] → [transcribed] → [enhanced] → [completed] (no diarization)
[processing] → [failed] → [retry] → [processing]

Data Validation Rules

Rule 1: Processing time must be >0 and <1800 seconds (30 minutes)
Rule 2: Accuracy estimate must be between 0.0 and 1.0
Rule 3: Speaker count must be ≥1 if diarization is enabled
Rule 4: Confidence scores must be between 0.0 and 1.0
Rule 5: Domain must be valid if specified

🧪 Testing Requirements

Unit Tests

test_multi_pass_pipeline: Test all pipeline stages and transitions
test_diarization_service: Test Pyannote.audio integration
test_lora_adapter_manager: Test domain-specific model loading
test_confidence_scoring: Test confidence calculation and thresholding
test_web_interface: Test Flask/FastAPI endpoints
test_parallel_processing: Test concurrent pipeline execution

Integration Tests

test_pipeline_v2_complete: End-to-end v2 transcription with diarization
test_domain_adaptation: Test LoRA adapter switching and accuracy
test_batch_processing_v2: Process 10 files with v2 pipeline
test_cli_batch_processing: Test CLI batch processing with multiple files
test_performance_targets: Verify <25 second processing time

Edge Cases

Single speaker in multi-speaker file: Should handle gracefully
Poor audio quality with diarization: Should provide clear warnings
Memory pressure during processing: Should handle gracefully
LoRA adapter loading failure: Should fall back to base model
CLI progress reporting: Should show real-time updates
Large files with diarization: Should chunk appropriately

🚀 Implementation Phases

Phase 1: Multi-Pass Pipeline Foundation (Weeks 1-2)

Goal: Implement core multi-pass transcription pipeline

Enhanced task system with pipeline stages - Support complex multi-stage workflows
ModelManager singleton for model caching - Prevent memory duplication
Multi-pass implementation (fast + refinement + enhancement) - Achieve 99.5%+ accuracy
Confidence scoring system - Identify low-confidence segments
Performance optimization (8-bit quantization) - Reduce memory usage by 50%

Phase 2: Speaker Diarization Integration (Weeks 3-4)

Goal: Integrate Pyannote.audio for speaker identification

Pyannote.audio integration - 90%+ speaker identification accuracy
Parallel diarization and transcription - Minimize total processing time
Speaker embedding caching - Avoid model reloading
Diarization-transcript merging - Combine timestamps and speaker labels
Speaker profile storage - Track speakers across multiple files

Phase 3: Domain Adaptation and LoRA (Weeks 5-6)

Goal: Implement domain-specific model adaptation

LoRA adapter system - Lightweight domain-specific models
Domain auto-detection - Automatic content analysis
Pre-trained domain models - Technical, medical, academic domains
Custom domain training - User-provided data support
Domain switching optimization - Fast adapter loading

Phase 4: Enhanced CLI Interface (Weeks 7-8)

Goal: Develop enhanced CLI interface with improved batch processing

Enhanced progress reporting - Real-time stage updates
Batch processing improvements - Configurable concurrency
Detailed logging system - Configurable verbosity levels
Performance monitoring - CPU/memory usage display
Error handling improvements - Clear retry guidance

Phase 5: Performance Optimization and Polish (Weeks 9-10)

Goal: Achieve performance targets and final polish

Performance benchmarking - Verify <25 second processing time
Memory optimization - Stay under 8GB peak usage
Error handling refinement - Comprehensive error recovery
Documentation and user guides - Complete documentation
Final testing and validation - End-to-end testing

🔒 Security & Constraints

Security Requirements

API Key Management: Secure storage of all API keys (Whisper, DeepSeek, HuggingFace)
Local Access Only: CLI interface only, no network exposure
File Access: Local file system access only
Data Protection: Encrypted storage for sensitive transcripts
Input Sanitization: Validate all file paths, URLs, and user inputs

Performance Constraints

Response Time: <25 seconds for 5-minute audio (v2)
Accuracy Target: 99.5%+ transcription accuracy
Diarization Accuracy: 90%+ speaker identification accuracy
Memory Usage: <8GB peak memory usage (M3 MacBook 16GB)
Parallel Workers: 8 workers for optimal M3 performance
Model Loading: <5 seconds for model switching

Technical Constraints

File Formats: mp3, mp4, wav, m4a, webm only
File Size: Maximum 500MB per file
Audio Duration: Maximum 2 hours per file
Network: Download-first, no streaming processing
Storage: Local storage required, no cloud-only processing
Single Node: No distributed processing, single-machine architecture
YouTube: Curl-based metadata extraction only

✅ Definition of Done

Feature Complete

All acceptance criteria met with real test files
99.5%+ accuracy achieved on test dataset
90%+ speaker identification accuracy achieved
<25 second processing time for 5-minute files
Unit tests passing with >80% coverage
Integration tests passing with actual services
Code review completed
Documentation updated in rule files
Performance benchmarks met

Ready for Deployment

Performance targets achieved (speed, accuracy, memory)
Security review completed
Error handling tested with edge cases
User acceptance testing with real files
CLI interface tested and functional
Rollback plan prepared for v2 deployment
Monitoring and logging configured

Trax v2-Specific Criteria

Multi-pass pipeline delivers 99.5%+ accuracy
Speaker diarization works reliably across content types
Domain adaptation improves accuracy for specialized content
CLI interface provides superior user experience
Performance targets met without distributed architecture
Memory usage optimized for single-node deployment
Backward compatibility maintained with v1 features

This PRD is specifically designed for Trax v2, focusing on high performance and speaker diarization as the core differentiators while maintaining the simplicity and determinism of the single-node architecture.

18 KiB Raw Blame History

Trax v2.0 PRD: High-Performance Transcription with Speaker Diarization

🎯 Product Vision

🏗️ System Architecture Overview

Core Components

System Boundaries

👥 User Profile

Primary User: Advanced Personal Researcher

🔧 Functional Requirements

Feature 1: Multi-Pass Transcription Pipeline

Purpose

User Stories

Acceptance Criteria

Input Validation Rules

Business Logic Rules

Error Handling

Feature 2: Speaker Diarization with Pyannote.audio

Purpose

User Stories

Acceptance Criteria

Input Validation Rules

Business Logic Rules

Error Handling

Feature 3: Domain-Specific Model Adaptation (LoRA)

Purpose

User Stories

Acceptance Criteria

Input Validation Rules

Business Logic Rules

Error Handling

Feature 4: Enhanced CLI Interface

Purpose

User Stories

Acceptance Criteria

Input Validation Rules

Business Logic Rules

Error Handling

💻 CLI Interface Flows

Flow 1: High-Performance Transcription

Command: Single File Processing

Progress Reporting

Flow 2: Batch Processing with Diarization

Command: Batch Processing

Batch Progress Reporting

🔄 Data Flow & State Management

Enhanced Data Models (PostgreSQL Schema)

Transcript (Enhanced for v2)

SpeakerProfile (New for v2)

ProcessingJob (New for v2)

Enhanced State Transitions

ProcessingJob State Machine

Transcript State Machine (Enhanced)

Data Validation Rules

🧪 Testing Requirements

Unit Tests

Integration Tests

Edge Cases

🚀 Implementation Phases

Phase 1: Multi-Pass Pipeline Foundation (Weeks 1-2)

Phase 2: Speaker Diarization Integration (Weeks 3-4)

Phase 3: Domain Adaptation and LoRA (Weeks 5-6)

Phase 4: Enhanced CLI Interface (Weeks 7-8)

Phase 5: Performance Optimization and Polish (Weeks 9-10)

🔒 Security & Constraints

Security Requirements

Performance Constraints

Technical Constraints

✅ Definition of Done

Feature Complete

Ready for Deployment

Trax v2-Specific Criteria

18 KiB

Raw Blame History