trax/.taskmaster/docs/prd-v1.0.md

373 lines
16 KiB
Markdown

# Trax v1-v2 PRD: Personal Research Transcription Tool
## 🎯 Product Vision
*"We're building a personal transcription tool that enables researchers to batch-process tech podcasts, academic lectures, and audiobooks by downloading media locally and running high-accuracy transcription, resulting in searchable, structured text content for study and research."*
## 🏗️ System Architecture Overview
### Core Components
- **Data Layer**: PostgreSQL with JSONB, SQLAlchemy registry pattern
- **Business Logic**: Protocol-based services, async/await throughout
- **Interface Layer**: CLI-first with Click, batch processing focus
- **Integration Layer**: Download-first architecture, curl-based YouTube metadata
### System Boundaries
- **What's In Scope**: Local media processing, YouTube metadata extraction, batch transcription, JSON/TXT export
- **What's Out of Scope**: Real-time streaming, web UI, multi-user support, cloud processing
- **Integration Points**: Whisper API, DeepSeek API, FFmpeg, PostgreSQL, YouTube (curl)
## 👥 User Profile
### Primary User: Personal Researcher
- **Role**: Individual researcher processing educational content
- **Content Types**: Tech podcasts, academic lectures, audiobooks
- **Workflow**: Batch URL collection → Download → Transcribe → Study
- **Goals**: High accuracy transcripts, fast processing, searchable content
- **Constraints**: Local storage, API costs, processing time
## 🔧 Functional Requirements
### Feature 1: YouTube URL Processing
#### Purpose
Extract metadata from YouTube URLs using curl to avoid API complexity
#### User Stories
- **As a** researcher, **I want** to provide YouTube URLs, **so that** I can get video metadata and download links
- **As a** researcher, **I want** to batch process multiple URLs, **so that** I can queue up content for transcription
#### Acceptance Criteria
- [ ] **Given** a YouTube URL, **When** I run `trax youtube <url>`, **Then** I get title, channel, description, and duration
- [ ] **Given** a list of URLs, **When** I run `trax batch-urls <file>`, **Then** all metadata is extracted and stored
- [ ] **Given** invalid URLs, **When** I process them, **Then** clear error messages are shown
#### Input Validation Rules
- **URL format**: Must be valid YouTube URL - Error: "Invalid YouTube URL"
- **URL accessibility**: Must be publicly accessible - Error: "Video not accessible"
- **Rate limiting**: Max 10 URLs per minute - Error: "Rate limit exceeded"
#### Business Logic Rules
- **Rule 1**: Use curl with user-agent to avoid blocking
- **Rule 2**: Extract metadata using regex patterns targeting ytInitialPlayerResponse and ytInitialData objects
- **Rule 3**: Store metadata in PostgreSQL for future reference
- **Rule 4**: Generate unique filenames based on video ID and title
- **Rule 5**: Handle escaped characters in titles and descriptions using Perl regex patterns
#### Error Handling
- **Network Error**: Retry up to 3 times with exponential backoff
- **Invalid URL**: Skip and continue with remaining URLs
- **Rate Limited**: Wait 60 seconds before retrying
### Feature 2: Local Media Transcription (v1)
#### Purpose
High-accuracy transcription of downloaded media files using Whisper
#### User Stories
- **As a** researcher, **I want** to transcribe downloaded media, **so that** I can study the content
- **As a** researcher, **I want** batch processing, **so that** I can process multiple files efficiently
#### Acceptance Criteria
- [ ] **Given** a downloaded media file, **When** I run `trax transcribe <file>`, **Then** I get 95%+ accuracy transcript in <30 seconds
- [ ] **Given** a folder of media files, **When** I run `trax batch <folder>`, **Then** all files are processed with progress tracking
- [ ] **Given** poor audio quality, **When** I transcribe, **Then** I get a quality warning with accuracy estimate
#### Input Validation Rules
- **File format**: mp3, mp4, wav, m4a, webm - Error: "Unsupported format"
- **File size**: 500MB - Error: "File too large, max 500MB"
- **Audio duration**: >0.1 seconds - Error: "File too short or silent"
#### Business Logic Rules
- **Rule 1**: Always download media before processing (no streaming)
- **Rule 2**: Convert audio to 16kHz mono WAV for Whisper
- **Rule 3**: Use distil-large-v3 model with M3 optimizations
- **Rule 4**: Store results in PostgreSQL with JSONB for transcripts
#### Error Handling
- **Whisper Memory Error**: Implement chunking for files >10 minutes
- **Audio Quality**: Warn user if estimated accuracy <80%
- **Processing Failure**: Save partial results, allow retry from last successful stage
### Feature 3: AI Enhancement (v2)
#### Purpose
Improve transcript accuracy and readability using DeepSeek
#### User Stories
- **As a** researcher, **I want** enhanced transcripts, **so that** technical terms and punctuation are correct
- **As a** researcher, **I want** to compare original vs enhanced, **so that** I can verify improvements
#### Acceptance Criteria
- [ ] **Given** a v1 transcript, **When** I run enhancement, **Then** accuracy improves to 99%
- [ ] **Given** an enhanced transcript, **When** I compare to original, **Then** no content is lost
- [ ] **Given** enhancement fails, **When** I retry, **Then** original transcript is preserved
#### Input Validation Rules
- **Transcript format**: Must be valid JSON with segments - Error: "Invalid transcript format"
- **Enhancement model**: Must be available (DeepSeek API key) - Error: "Enhancement service unavailable"
#### Business Logic Rules
- **Rule 1**: Preserve timestamps and speaker markers during enhancement
- **Rule 2**: Use structured enhancement prompts for technical content
- **Rule 3**: Cache enhancement results for 7 days to reduce costs
#### Error Handling
- **API Rate Limit**: Queue enhancement for later processing
- **Enhancement Failure**: Return original transcript with error flag
- **Content Loss**: Validate enhancement preserves original length ±5%
## 🖥️ User Interface Flows
### Flow 1: YouTube URL Processing
#### Screen 1: URL Input
- **Purpose**: User provides YouTube URLs for processing
- **Elements**:
- URL input: Text input - Single URL or file path
- Batch option: Flag - --batch for multiple URLs
- Output format: Flag - --json, --txt for metadata export
- **Actions**:
- Enter: Process URL Metadata display
- Ctrl+C: Cancel operation Return to prompt
- **Validation**: URL format, accessibility, rate limits
- **Error States**: "Invalid URL", "Video not accessible", "Rate limit exceeded"
#### Screen 2: Metadata Display
- **Purpose**: Show extracted metadata and download options
- **Elements**:
- Video info: Text display - Title, channel, duration
- Download option: Flag - --download to save media
- Queue option: Flag - --queue for batch processing
- **Actions**:
- Download: Save media file Download progress
- Queue: Add to batch queue Queue confirmation
- Next: Process another URL Return to input
### Flow 2: Batch Transcription
#### Screen 1: Batch Command Input
- **Purpose**: User initiates batch transcription
- **Elements**:
- Directory path: Text input - Folder containing media files
- Pipeline version: Flag - --v1, --v2 (default: v1)
- Parallel workers: Flag - --workers (default: 8 for M3 MacBook)
- Quality threshold: Flag - --min-accuracy (default: 80%)
- **Actions**:
- Enter: Start batch processing Progress tracking
- Preview: Show file list File list display
- **Validation**: Directory exists, contains supported files
#### Screen 2: Batch Progress
- **Purpose**: Show real-time processing status
- **Elements**:
- Overall progress: Progress bar - Total batch completion
- Current file: Text display - Currently processing file
- Quality metrics: Text display - Accuracy estimates
- Queue status: Text display - Files remaining, completed, failed
- **Actions**:
- Continue: Automatic progression Results summary
- Pause: Suspend processing Resume option
- **Validation**: Updates every 5 seconds, shows quality warnings
#### Screen 3: Results Summary
- **Purpose**: Show batch processing results
- **Elements**:
- Success count: Text display - Files processed successfully
- Failure count: Text display - Files that failed
- Quality report: Text display - Average accuracy, warnings
- Export options: Buttons - JSON, TXT, SRT formats
- **Actions**:
- Export: Save all transcripts Export confirmation
- Retry failed: Re-process failed files Retry progress
- New batch: Start over Return to input
## 🔄 Data Flow & State Management
### Data Models (PostgreSQL Schema)
#### YouTubeVideo
```json
{
"id": "UUID (required, primary key)",
"youtube_id": "string (required, unique)",
"title": "string (required)",
"channel": "string (required)",
"description": "text (optional)",
"duration_seconds": "integer (required)",
"url": "string (required)",
"metadata_extracted_at": "timestamp (auto-generated)",
"created_at": "timestamp (auto-generated)"
}
```
#### MediaFile
```json
{
"id": "UUID (required, primary key)",
"youtube_video_id": "UUID (optional, foreign key)",
"local_path": "string (required, file location)",
"media_type": "string (required, mp3, mp4, wav, etc.)",
"duration_seconds": "integer (optional)",
"file_size_bytes": "bigint (required)",
"download_status": "enum (pending, downloading, completed, failed)",
"created_at": "timestamp (auto-generated)",
"updated_at": "timestamp (auto-updated)"
}
```
#### Transcript
```json
{
"id": "UUID (required, primary key)",
"media_file_id": "UUID (required, foreign key)",
"pipeline_version": "string (required, v1, v2)",
"raw_content": "JSONB (required, Whisper output)",
"enhanced_content": "JSONB (optional, AI enhanced)",
"text_content": "text (required, plain text for search)",
"model_used": "string (required, whisper model version)",
"processing_time_ms": "integer (required)",
"word_count": "integer (required)",
"accuracy_estimate": "float (optional, 0.0-1.0)",
"quality_warnings": "string array (optional)",
"processing_metadata": "JSONB (optional, version-specific data)",
"created_at": "timestamp (auto-generated)",
"enhanced_at": "timestamp (optional)",
"updated_at": "timestamp (auto-updated)"
}
```
### State Transitions
#### YouTubeVideo State Machine
```
[url_provided] → [metadata_extracting] → [metadata_complete]
[url_provided] → [metadata_extracting] → [metadata_failed]
```
#### MediaFile State Machine
```
[pending] → [downloading] → [completed]
[pending] → [downloading] → [failed] → [retry] → [downloading]
```
#### Transcript State Machine
```
[processing] → [completed]
[processing] → [failed] → [retry] → [processing]
[completed] → [enhancing] → [enhanced]
[enhanced] → [final]
```
### Data Validation Rules
- **Rule 1**: File size must be 500MB for processing
- **Rule 2**: Audio duration must be >0.1 seconds (not silent)
- **Rule 3**: Transcript must contain at least one segment
- **Rule 4**: Processing time must be >0 and <3600 seconds (1 hour)
- **Rule 5**: YouTube ID must be unique in database
## 🧪 Testing Requirements
### Unit Tests
- [ ] `test_youtube_metadata_extractor`: Extract metadata using curl and regex patterns
- [ ] `test_media_downloader`: Download from various sources
- [ ] `test_audio_preprocessor`: Convert to 16kHz mono WAV
- [ ] `test_whisper_service`: Basic transcription functionality
- [ ] `test_enhancement_service`: AI enhancement with DeepSeek
- [ ] `test_batch_processor`: Parallel file processing with error tracking
### Integration Tests
- [ ] `test_pipeline_v1`: End-to-end v1 transcription
- [ ] `test_pipeline_v2`: End-to-end v2 with enhancement
- [ ] `test_batch_processing`: Process 10 files in parallel
- [ ] `test_database_operations`: PostgreSQL CRUD operations
- [ ] `test_export_formats`: JSON and TXT export functionality
### Edge Cases
- [ ] Silent audio file: Should detect and report appropriately
- [ ] Corrupted media file: Should handle gracefully with clear error
- [ ] Network interruption during download: Should retry automatically
- [ ] Large file (>10 minutes): Should chunk automatically
- [ ] Memory pressure: Should handle gracefully with resource limits
- [ ] Poor audio quality: Should warn user about accuracy expectations
## 🚀 Implementation Phases
### Phase 1: Core Foundation (Weeks 1-2)
**Goal**: Basic transcription working with CLI
- [ ] PostgreSQL database setup with JSONB - Schema created and tested
- [ ] YouTube metadata extraction with curl - Extract title, channel, description, duration using regex patterns
- [ ] Basic Whisper integration (v1) - 95% accuracy on test files
- [ ] Batch processing system - Handle 10+ files in parallel with error tracking
- [ ] CLI implementation with Click - All commands functional
- [ ] JSON/TXT export functionality - Both formats working
### Phase 2: Enhancement (Week 3)
**Goal**: AI enhancement working reliably
- [ ] DeepSeek integration - API calls working with retry logic
- [ ] Enhancement templates - Structured prompts for technical content
- [ ] Progress tracking - Real-time updates in CLI
- [ ] Quality validation - Compare before/after accuracy
- [ ] Error recovery - Handle API failures gracefully
### Phase 3: Roadmap - Multi-Pass Accuracy (v3)
**Goal**: Multi-pass accuracy improvements
- [ ] Multi-pass implementation - 3 passes with different parameters
- [ ] Confidence scoring - Per-segment confidence metrics
- [ ] Segment merging - Best segment selection algorithm
- [ ] Performance optimization - 3x speed improvement over v1
- [ ] Memory management - Handle large files efficiently
### Phase 4: Roadmap - Speaker Diarization (v4)
**Goal**: Speaker diarization and scaling
- [ ] Speaker diarization - 90% speaker identification accuracy
- [ ] Voice embedding database - Speaker profile storage
- [ ] Caching layer - 50% cost reduction through caching
- [ ] API endpoints - REST API for integration
- [ ] Production deployment - Monitoring and logging
## 🔒 Security & Constraints
### Security Requirements
- **API Key Management**: Secure storage of Whisper and DeepSeek API keys
- **File Access**: Local file system access only
- **Data Protection**: Encrypted storage for sensitive transcripts
- **Input Sanitization**: Validate all file paths and URLs
### Performance Constraints
- **Response Time**: <30 seconds for 5-minute audio (v1)
- **Throughput**: Process 100+ files in batch
- **Memory Usage**: <8GB peak memory usage (M3 MacBook 16GB)
- **Database Queries**: <1 second for transcript retrieval
- **Parallel Workers**: 8 workers for optimal M3 performance
### Technical Constraints
- **File Formats**: mp3, mp4, wav, m4a, webm only
- **File Size**: Maximum 500MB per file
- **Audio Duration**: Maximum 2 hours per file
- **Network**: Download-first, no streaming processing
- **Storage**: Local storage required, no cloud-only processing
- **YouTube**: Curl-based metadata extraction only
## ✅ Definition of Done
### Feature Complete
- [ ] All acceptance criteria met with real test files
- [ ] Unit tests passing with >80% coverage
- [ ] Integration tests passing with actual services
- [ ] Code review completed
- [ ] Documentation updated in rule files
- [ ] Performance benchmarks met
### Ready for Deployment
- [ ] Performance targets achieved (speed, accuracy, memory)
- [ ] Security review completed
- [ ] Error handling tested with edge cases
- [ ] User acceptance testing with real files
- [ ] Rollback plan prepared for each version
- [ ] Monitoring and logging configured
### Trax-Specific Criteria
- [ ] Follows protocol-based architecture
- [ ] Uses download-first approach (no streaming)
- [ ] Implements proper error handling with actionable messages
- [ ] Maintains backward compatibility across versions
- [ ] Uses real files in tests (no mocks)
- [ ] Follows established rule files and patterns
- [ ] Handles tech podcast and academic lecture content effectively
---
*This PRD is specifically designed for a personal research tool focused on tech podcasts, academic lectures, and audiobooks, with clear v1-v2 implementation and v3-v4 roadmap.*