trax/.taskmaster/docs/prd-v1.0.md

# Trax v1-v2 PRD: Personal Research Transcription Tool

## 🎯 Product Vision
*"We're building a personal transcription tool that enables researchers to batch-process tech podcasts, academic lectures, and audiobooks by downloading media locally and running high-accuracy transcription, resulting in searchable, structured text content for study and research."*

## 🏗️ System Architecture Overview
### Core Components
- **Data Layer**: PostgreSQL with JSONB, SQLAlchemy registry pattern
- **Business Logic**: Protocol-based services, async/await throughout
- **Interface Layer**: CLI-first with Click, batch processing focus
- **Integration Layer**: Download-first architecture, curl-based YouTube metadata

### System Boundaries
- **What's In Scope**: Local media processing, YouTube metadata extraction, batch transcription, JSON/TXT export
- **What's Out of Scope**: Real-time streaming, web UI, multi-user support, cloud processing
- **Integration Points**: Whisper API, DeepSeek API, FFmpeg, PostgreSQL, YouTube (curl)

## 👥 User Profile
### Primary User: Personal Researcher
- **Role**: Individual researcher processing educational content
- **Content Types**: Tech podcasts, academic lectures, audiobooks
- **Workflow**: Batch URL collection → Download → Transcribe → Study
- **Goals**: High accuracy transcripts, fast processing, searchable content
- **Constraints**: Local storage, API costs, processing time

## 🔧 Functional Requirements

### Feature 1: YouTube URL Processing
#### Purpose
Extract metadata from YouTube URLs using curl to avoid API complexity

#### User Stories
- **As a** researcher, **I want** to provide YouTube URLs, **so that** I can get video metadata and download links
- **As a** researcher, **I want** to batch process multiple URLs, **so that** I can queue up content for transcription

#### Acceptance Criteria
- [ ] **Given** a YouTube URL, **When** I run `trax youtube <url>`, **Then** I get title, channel, description, and duration
- [ ] **Given** a list of URLs, **When** I run `trax batch-urls <file>`, **Then** all metadata is extracted and stored
- [ ] **Given** invalid URLs, **When** I process them, **Then** clear error messages are shown

#### Input Validation Rules
- **URL format**: Must be valid YouTube URL - Error: "Invalid YouTube URL"
- **URL accessibility**: Must be publicly accessible - Error: "Video not accessible"
- **Rate limiting**: Max 10 URLs per minute - Error: "Rate limit exceeded"

#### Business Logic Rules
- **Rule 1**: Use curl with user-agent to avoid blocking
- **Rule 2**: Extract metadata using regex patterns targeting ytInitialPlayerResponse and ytInitialData objects
- **Rule 3**: Store metadata in PostgreSQL for future reference
- **Rule 4**: Generate unique filenames based on video ID and title
- **Rule 5**: Handle escaped characters in titles and descriptions using Perl regex patterns

#### Error Handling
- **Network Error**: Retry up to 3 times with exponential backoff
- **Invalid URL**: Skip and continue with remaining URLs
- **Rate Limited**: Wait 60 seconds before retrying

### Feature 2: Local Media Transcription (v1)
#### Purpose
High-accuracy transcription of downloaded media files using Whisper

#### User Stories
- **As a** researcher, **I want** to transcribe downloaded media, **so that** I can study the content
- **As a** researcher, **I want** batch processing, **so that** I can process multiple files efficiently

#### Acceptance Criteria
- [ ] **Given** a downloaded media file, **When** I run `trax transcribe <file>`, **Then** I get 95%+ accuracy transcript in <30 seconds
- [ ] **Given** a folder of media files, **When** I run `trax batch <folder>`, **Then** all files are processed with progress tracking
- [ ] **Given** poor audio quality, **When** I transcribe, **Then** I get a quality warning with accuracy estimate

#### Input Validation Rules
- **File format**: mp3, mp4, wav, m4a, webm - Error: "Unsupported format"
- **File size**: ≤500MB - Error: "File too large, max 500MB"
- **Audio duration**: >0.1 seconds - Error: "File too short or silent"

#### Business Logic Rules
- **Rule 1**: Always download media before processing (no streaming)
- **Rule 2**: Convert audio to 16kHz mono WAV for Whisper
- **Rule 3**: Use distil-large-v3 model with M3 optimizations
- **Rule 4**: Store results in PostgreSQL with JSONB for transcripts

#### Error Handling
- **Whisper Memory Error**: Implement chunking for files >10 minutes
- **Audio Quality**: Warn user if estimated accuracy <80%
- **Processing Failure**: Save partial results, allow retry from last successful stage

### Feature 3: AI Enhancement (v2)
#### Purpose
Improve transcript accuracy and readability using DeepSeek

#### User Stories
- **As a** researcher, **I want** enhanced transcripts, **so that** technical terms and punctuation are correct
- **As a** researcher, **I want** to compare original vs enhanced, **so that** I can verify improvements

#### Acceptance Criteria
- [ ] **Given** a v1 transcript, **When** I run enhancement, **Then** accuracy improves to ≥99%
- [ ] **Given** an enhanced transcript, **When** I compare to original, **Then** no content is lost
- [ ] **Given** enhancement fails, **When** I retry, **Then** original transcript is preserved

#### Input Validation Rules
- **Transcript format**: Must be valid JSON with segments - Error: "Invalid transcript format"
- **Enhancement model**: Must be available (DeepSeek API key) - Error: "Enhancement service unavailable"

#### Business Logic Rules
- **Rule 1**: Preserve timestamps and speaker markers during enhancement
- **Rule 2**: Use structured enhancement prompts for technical content
- **Rule 3**: Cache enhancement results for 7 days to reduce costs

#### Error Handling
- **API Rate Limit**: Queue enhancement for later processing
- **Enhancement Failure**: Return original transcript with error flag
- **Content Loss**: Validate enhancement preserves original length ±5%

## 🖥️ User Interface Flows

### Flow 1: YouTube URL Processing
#### Screen 1: URL Input
- **Purpose**: User provides YouTube URLs for processing
- **Elements**:
  - URL input: Text input - Single URL or file path
  - Batch option: Flag - --batch for multiple URLs
  - Output format: Flag - --json, --txt for metadata export
- **Actions**:
  - Enter: Process URL → Metadata display
  - Ctrl+C: Cancel operation → Return to prompt
- **Validation**: URL format, accessibility, rate limits
- **Error States**: "Invalid URL", "Video not accessible", "Rate limit exceeded"

#### Screen 2: Metadata Display
- **Purpose**: Show extracted metadata and download options
- **Elements**:
  - Video info: Text display - Title, channel, duration
  - Download option: Flag - --download to save media
  - Queue option: Flag - --queue for batch processing
- **Actions**:
  - Download: Save media file → Download progress
  - Queue: Add to batch queue → Queue confirmation
  - Next: Process another URL → Return to input

### Flow 2: Batch Transcription
#### Screen 1: Batch Command Input
- **Purpose**: User initiates batch transcription
- **Elements**:
  - Directory path: Text input - Folder containing media files
  - Pipeline version: Flag - --v1, --v2 (default: v1)
  - Parallel workers: Flag - --workers (default: 8 for M3 MacBook)
  - Quality threshold: Flag - --min-accuracy (default: 80%)
- **Actions**:
  - Enter: Start batch processing → Progress tracking
  - Preview: Show file list → File list display
- **Validation**: Directory exists, contains supported files

#### Screen 2: Batch Progress
- **Purpose**: Show real-time processing status
- **Elements**:
  - Overall progress: Progress bar - Total batch completion
  - Current file: Text display - Currently processing file
  - Quality metrics: Text display - Accuracy estimates
  - Queue status: Text display - Files remaining, completed, failed
- **Actions**:
  - Continue: Automatic progression → Results summary
  - Pause: Suspend processing → Resume option
- **Validation**: Updates every 5 seconds, shows quality warnings

#### Screen 3: Results Summary
- **Purpose**: Show batch processing results
- **Elements**:
  - Success count: Text display - Files processed successfully
  - Failure count: Text display - Files that failed
  - Quality report: Text display - Average accuracy, warnings
  - Export options: Buttons - JSON, TXT, SRT formats
- **Actions**:
  - Export: Save all transcripts → Export confirmation
  - Retry failed: Re-process failed files → Retry progress
  - New batch: Start over → Return to input

## 🔄 Data Flow & State Management

### Data Models (PostgreSQL Schema)
#### YouTubeVideo
```json
{
  "id": "UUID (required, primary key)",
  "youtube_id": "string (required, unique)",
  "title": "string (required)",
  "channel": "string (required)",
  "description": "text (optional)",
  "duration_seconds": "integer (required)",
  "url": "string (required)",
  "metadata_extracted_at": "timestamp (auto-generated)",
  "created_at": "timestamp (auto-generated)"
}
```

#### MediaFile
```json
{
  "id": "UUID (required, primary key)",
  "youtube_video_id": "UUID (optional, foreign key)",
  "local_path": "string (required, file location)",
  "media_type": "string (required, mp3, mp4, wav, etc.)",
  "duration_seconds": "integer (optional)",
  "file_size_bytes": "bigint (required)",
  "download_status": "enum (pending, downloading, completed, failed)",
  "created_at": "timestamp (auto-generated)",
  "updated_at": "timestamp (auto-updated)"
}
```

#### Transcript
```json
{
  "id": "UUID (required, primary key)",
  "media_file_id": "UUID (required, foreign key)",
  "pipeline_version": "string (required, v1, v2)",
  "raw_content": "JSONB (required, Whisper output)",
  "enhanced_content": "JSONB (optional, AI enhanced)",
  "text_content": "text (required, plain text for search)",
  "model_used": "string (required, whisper model version)",
  "processing_time_ms": "integer (required)",
  "word_count": "integer (required)",
  "accuracy_estimate": "float (optional, 0.0-1.0)",
  "quality_warnings": "string array (optional)",
  "processing_metadata": "JSONB (optional, version-specific data)",
  "created_at": "timestamp (auto-generated)",
  "enhanced_at": "timestamp (optional)",
  "updated_at": "timestamp (auto-updated)"
}
```

### State Transitions
#### YouTubeVideo State Machine
```
[url_provided] → [metadata_extracting] → [metadata_complete]
[url_provided] → [metadata_extracting] → [metadata_failed]
```

#### MediaFile State Machine
```
[pending] → [downloading] → [completed]
[pending] → [downloading] → [failed] → [retry] → [downloading]
```

#### Transcript State Machine
```
[processing] → [completed]
[processing] → [failed] → [retry] → [processing]
[completed] → [enhancing] → [enhanced]
[enhanced] → [final]
```

### Data Validation Rules
- **Rule 1**: File size must be ≤500MB for processing
- **Rule 2**: Audio duration must be >0.1 seconds (not silent)
- **Rule 3**: Transcript must contain at least one segment
- **Rule 4**: Processing time must be >0 and <3600 seconds (1 hour)
- **Rule 5**: YouTube ID must be unique in database

## 🧪 Testing Requirements

### Unit Tests
- [ ] `test_youtube_metadata_extractor`: Extract metadata using curl and regex patterns
- [ ] `test_media_downloader`: Download from various sources
- [ ] `test_audio_preprocessor`: Convert to 16kHz mono WAV
- [ ] `test_whisper_service`: Basic transcription functionality
- [ ] `test_enhancement_service`: AI enhancement with DeepSeek
- [ ] `test_batch_processor`: Parallel file processing with error tracking

### Integration Tests
- [ ] `test_pipeline_v1`: End-to-end v1 transcription
- [ ] `test_pipeline_v2`: End-to-end v2 with enhancement
- [ ] `test_batch_processing`: Process 10 files in parallel
- [ ] `test_database_operations`: PostgreSQL CRUD operations
- [ ] `test_export_formats`: JSON and TXT export functionality

### Edge Cases
- [ ] Silent audio file: Should detect and report appropriately
- [ ] Corrupted media file: Should handle gracefully with clear error
- [ ] Network interruption during download: Should retry automatically
- [ ] Large file (>10 minutes): Should chunk automatically
- [ ] Memory pressure: Should handle gracefully with resource limits
- [ ] Poor audio quality: Should warn user about accuracy expectations

## 🚀 Implementation Phases

### Phase 1: Core Foundation (Weeks 1-2)
**Goal**: Basic transcription working with CLI
- [ ] PostgreSQL database setup with JSONB - Schema created and tested
- [ ] YouTube metadata extraction with curl - Extract title, channel, description, duration using regex patterns
- [ ] Basic Whisper integration (v1) - 95% accuracy on test files
- [ ] Batch processing system - Handle 10+ files in parallel with error tracking
- [ ] CLI implementation with Click - All commands functional
- [ ] JSON/TXT export functionality - Both formats working

### Phase 2: Enhancement (Week 3)
**Goal**: AI enhancement working reliably
- [ ] DeepSeek integration - API calls working with retry logic
- [ ] Enhancement templates - Structured prompts for technical content
- [ ] Progress tracking - Real-time updates in CLI
- [ ] Quality validation - Compare before/after accuracy
- [ ] Error recovery - Handle API failures gracefully

### Phase 3: Roadmap - Multi-Pass Accuracy (v3)
**Goal**: Multi-pass accuracy improvements
- [ ] Multi-pass implementation - 3 passes with different parameters
- [ ] Confidence scoring - Per-segment confidence metrics
- [ ] Segment merging - Best segment selection algorithm
- [ ] Performance optimization - 3x speed improvement over v1
- [ ] Memory management - Handle large files efficiently

### Phase 4: Roadmap - Speaker Diarization (v4)
**Goal**: Speaker diarization and scaling
- [ ] Speaker diarization - 90% speaker identification accuracy
- [ ] Voice embedding database - Speaker profile storage
- [ ] Caching layer - 50% cost reduction through caching
- [ ] API endpoints - REST API for integration
- [ ] Production deployment - Monitoring and logging

## 🔒 Security & Constraints

### Security Requirements
- **API Key Management**: Secure storage of Whisper and DeepSeek API keys
- **File Access**: Local file system access only
- **Data Protection**: Encrypted storage for sensitive transcripts
- **Input Sanitization**: Validate all file paths and URLs

### Performance Constraints
- **Response Time**: <30 seconds for 5-minute audio (v1)
- **Throughput**: Process 100+ files in batch
- **Memory Usage**: <8GB peak memory usage (M3 MacBook 16GB)
- **Database Queries**: <1 second for transcript retrieval
- **Parallel Workers**: 8 workers for optimal M3 performance

### Technical Constraints
- **File Formats**: mp3, mp4, wav, m4a, webm only
- **File Size**: Maximum 500MB per file
- **Audio Duration**: Maximum 2 hours per file
- **Network**: Download-first, no streaming processing
- **Storage**: Local storage required, no cloud-only processing
- **YouTube**: Curl-based metadata extraction only

## ✅ Definition of Done

### Feature Complete
- [ ] All acceptance criteria met with real test files
- [ ] Unit tests passing with >80% coverage
- [ ] Integration tests passing with actual services
- [ ] Code review completed
- [ ] Documentation updated in rule files
- [ ] Performance benchmarks met

### Ready for Deployment
- [ ] Performance targets achieved (speed, accuracy, memory)
- [ ] Security review completed
- [ ] Error handling tested with edge cases
- [ ] User acceptance testing with real files
- [ ] Rollback plan prepared for each version
- [ ] Monitoring and logging configured

### Trax-Specific Criteria
- [ ] Follows protocol-based architecture
- [ ] Uses download-first approach (no streaming)
- [ ] Implements proper error handling with actionable messages
- [ ] Maintains backward compatibility across versions
- [ ] Uses real files in tests (no mocks)
- [ ] Follows established rule files and patterns
- [ ] Handles tech podcast and academic lecture content effectively

---

*This PRD is specifically designed for a personal research tool focused on tech podcasts, academic lectures, and audiobooks, with clear v1-v2 implementation and v3-v4 roadmap.*