16 KiB
Trax v1-v2 PRD: Personal Research Transcription Tool
🎯 Product Vision
"We're building a personal transcription tool that enables researchers to batch-process tech podcasts, academic lectures, and audiobooks by downloading media locally and running high-accuracy transcription, resulting in searchable, structured text content for study and research."
🏗️ System Architecture Overview
Core Components
- Data Layer: PostgreSQL with JSONB, SQLAlchemy registry pattern
- Business Logic: Protocol-based services, async/await throughout
- Interface Layer: CLI-first with Click, batch processing focus
- Integration Layer: Download-first architecture, curl-based YouTube metadata
System Boundaries
- What's In Scope: Local media processing, YouTube metadata extraction, batch transcription, JSON/TXT export
- What's Out of Scope: Real-time streaming, web UI, multi-user support, cloud processing
- Integration Points: Whisper API, DeepSeek API, FFmpeg, PostgreSQL, YouTube (curl)
👥 User Profile
Primary User: Personal Researcher
- Role: Individual researcher processing educational content
- Content Types: Tech podcasts, academic lectures, audiobooks
- Workflow: Batch URL collection → Download → Transcribe → Study
- Goals: High accuracy transcripts, fast processing, searchable content
- Constraints: Local storage, API costs, processing time
🔧 Functional Requirements
Feature 1: YouTube URL Processing
Purpose
Extract metadata from YouTube URLs using curl to avoid API complexity
User Stories
- As a researcher, I want to provide YouTube URLs, so that I can get video metadata and download links
- As a researcher, I want to batch process multiple URLs, so that I can queue up content for transcription
Acceptance Criteria
- Given a YouTube URL, When I run
trax youtube <url>, Then I get title, channel, description, and duration - Given a list of URLs, When I run
trax batch-urls <file>, Then all metadata is extracted and stored - Given invalid URLs, When I process them, Then clear error messages are shown
Input Validation Rules
- URL format: Must be valid YouTube URL - Error: "Invalid YouTube URL"
- URL accessibility: Must be publicly accessible - Error: "Video not accessible"
- Rate limiting: Max 10 URLs per minute - Error: "Rate limit exceeded"
Business Logic Rules
- Rule 1: Use curl with user-agent to avoid blocking
- Rule 2: Extract metadata using regex patterns targeting ytInitialPlayerResponse and ytInitialData objects
- Rule 3: Store metadata in PostgreSQL for future reference
- Rule 4: Generate unique filenames based on video ID and title
- Rule 5: Handle escaped characters in titles and descriptions using Perl regex patterns
Error Handling
- Network Error: Retry up to 3 times with exponential backoff
- Invalid URL: Skip and continue with remaining URLs
- Rate Limited: Wait 60 seconds before retrying
Feature 2: Local Media Transcription (v1)
Purpose
High-accuracy transcription of downloaded media files using Whisper
User Stories
- As a researcher, I want to transcribe downloaded media, so that I can study the content
- As a researcher, I want batch processing, so that I can process multiple files efficiently
Acceptance Criteria
- Given a downloaded media file, When I run
trax transcribe <file>, Then I get 95%+ accuracy transcript in <30 seconds - Given a folder of media files, When I run
trax batch <folder>, Then all files are processed with progress tracking - Given poor audio quality, When I transcribe, Then I get a quality warning with accuracy estimate
Input Validation Rules
- File format: mp3, mp4, wav, m4a, webm - Error: "Unsupported format"
- File size: ≤500MB - Error: "File too large, max 500MB"
- Audio duration: >0.1 seconds - Error: "File too short or silent"
Business Logic Rules
- Rule 1: Always download media before processing (no streaming)
- Rule 2: Convert audio to 16kHz mono WAV for Whisper
- Rule 3: Use distil-large-v3 model with M3 optimizations
- Rule 4: Store results in PostgreSQL with JSONB for transcripts
Error Handling
- Whisper Memory Error: Implement chunking for files >10 minutes
- Audio Quality: Warn user if estimated accuracy <80%
- Processing Failure: Save partial results, allow retry from last successful stage
Feature 3: AI Enhancement (v2)
Purpose
Improve transcript accuracy and readability using DeepSeek
User Stories
- As a researcher, I want enhanced transcripts, so that technical terms and punctuation are correct
- As a researcher, I want to compare original vs enhanced, so that I can verify improvements
Acceptance Criteria
- Given a v1 transcript, When I run enhancement, Then accuracy improves to ≥99%
- Given an enhanced transcript, When I compare to original, Then no content is lost
- Given enhancement fails, When I retry, Then original transcript is preserved
Input Validation Rules
- Transcript format: Must be valid JSON with segments - Error: "Invalid transcript format"
- Enhancement model: Must be available (DeepSeek API key) - Error: "Enhancement service unavailable"
Business Logic Rules
- Rule 1: Preserve timestamps and speaker markers during enhancement
- Rule 2: Use structured enhancement prompts for technical content
- Rule 3: Cache enhancement results for 7 days to reduce costs
Error Handling
- API Rate Limit: Queue enhancement for later processing
- Enhancement Failure: Return original transcript with error flag
- Content Loss: Validate enhancement preserves original length ±5%
🖥️ User Interface Flows
Flow 1: YouTube URL Processing
Screen 1: URL Input
- Purpose: User provides YouTube URLs for processing
- Elements:
- URL input: Text input - Single URL or file path
- Batch option: Flag - --batch for multiple URLs
- Output format: Flag - --json, --txt for metadata export
- Actions:
- Enter: Process URL → Metadata display
- Ctrl+C: Cancel operation → Return to prompt
- Validation: URL format, accessibility, rate limits
- Error States: "Invalid URL", "Video not accessible", "Rate limit exceeded"
Screen 2: Metadata Display
- Purpose: Show extracted metadata and download options
- Elements:
- Video info: Text display - Title, channel, duration
- Download option: Flag - --download to save media
- Queue option: Flag - --queue for batch processing
- Actions:
- Download: Save media file → Download progress
- Queue: Add to batch queue → Queue confirmation
- Next: Process another URL → Return to input
Flow 2: Batch Transcription
Screen 1: Batch Command Input
- Purpose: User initiates batch transcription
- Elements:
- Directory path: Text input - Folder containing media files
- Pipeline version: Flag - --v1, --v2 (default: v1)
- Parallel workers: Flag - --workers (default: 8 for M3 MacBook)
- Quality threshold: Flag - --min-accuracy (default: 80%)
- Actions:
- Enter: Start batch processing → Progress tracking
- Preview: Show file list → File list display
- Validation: Directory exists, contains supported files
Screen 2: Batch Progress
- Purpose: Show real-time processing status
- Elements:
- Overall progress: Progress bar - Total batch completion
- Current file: Text display - Currently processing file
- Quality metrics: Text display - Accuracy estimates
- Queue status: Text display - Files remaining, completed, failed
- Actions:
- Continue: Automatic progression → Results summary
- Pause: Suspend processing → Resume option
- Validation: Updates every 5 seconds, shows quality warnings
Screen 3: Results Summary
- Purpose: Show batch processing results
- Elements:
- Success count: Text display - Files processed successfully
- Failure count: Text display - Files that failed
- Quality report: Text display - Average accuracy, warnings
- Export options: Buttons - JSON, TXT, SRT formats
- Actions:
- Export: Save all transcripts → Export confirmation
- Retry failed: Re-process failed files → Retry progress
- New batch: Start over → Return to input
🔄 Data Flow & State Management
Data Models (PostgreSQL Schema)
YouTubeVideo
{
"id": "UUID (required, primary key)",
"youtube_id": "string (required, unique)",
"title": "string (required)",
"channel": "string (required)",
"description": "text (optional)",
"duration_seconds": "integer (required)",
"url": "string (required)",
"metadata_extracted_at": "timestamp (auto-generated)",
"created_at": "timestamp (auto-generated)"
}
MediaFile
{
"id": "UUID (required, primary key)",
"youtube_video_id": "UUID (optional, foreign key)",
"local_path": "string (required, file location)",
"media_type": "string (required, mp3, mp4, wav, etc.)",
"duration_seconds": "integer (optional)",
"file_size_bytes": "bigint (required)",
"download_status": "enum (pending, downloading, completed, failed)",
"created_at": "timestamp (auto-generated)",
"updated_at": "timestamp (auto-updated)"
}
Transcript
{
"id": "UUID (required, primary key)",
"media_file_id": "UUID (required, foreign key)",
"pipeline_version": "string (required, v1, v2)",
"raw_content": "JSONB (required, Whisper output)",
"enhanced_content": "JSONB (optional, AI enhanced)",
"text_content": "text (required, plain text for search)",
"model_used": "string (required, whisper model version)",
"processing_time_ms": "integer (required)",
"word_count": "integer (required)",
"accuracy_estimate": "float (optional, 0.0-1.0)",
"quality_warnings": "string array (optional)",
"processing_metadata": "JSONB (optional, version-specific data)",
"created_at": "timestamp (auto-generated)",
"enhanced_at": "timestamp (optional)",
"updated_at": "timestamp (auto-updated)"
}
State Transitions
YouTubeVideo State Machine
[url_provided] → [metadata_extracting] → [metadata_complete]
[url_provided] → [metadata_extracting] → [metadata_failed]
MediaFile State Machine
[pending] → [downloading] → [completed]
[pending] → [downloading] → [failed] → [retry] → [downloading]
Transcript State Machine
[processing] → [completed]
[processing] → [failed] → [retry] → [processing]
[completed] → [enhancing] → [enhanced]
[enhanced] → [final]
Data Validation Rules
- Rule 1: File size must be ≤500MB for processing
- Rule 2: Audio duration must be >0.1 seconds (not silent)
- Rule 3: Transcript must contain at least one segment
- Rule 4: Processing time must be >0 and <3600 seconds (1 hour)
- Rule 5: YouTube ID must be unique in database
🧪 Testing Requirements
Unit Tests
test_youtube_metadata_extractor: Extract metadata using curl and regex patternstest_media_downloader: Download from various sourcestest_audio_preprocessor: Convert to 16kHz mono WAVtest_whisper_service: Basic transcription functionalitytest_enhancement_service: AI enhancement with DeepSeektest_batch_processor: Parallel file processing with error tracking
Integration Tests
test_pipeline_v1: End-to-end v1 transcriptiontest_pipeline_v2: End-to-end v2 with enhancementtest_batch_processing: Process 10 files in paralleltest_database_operations: PostgreSQL CRUD operationstest_export_formats: JSON and TXT export functionality
Edge Cases
- Silent audio file: Should detect and report appropriately
- Corrupted media file: Should handle gracefully with clear error
- Network interruption during download: Should retry automatically
- Large file (>10 minutes): Should chunk automatically
- Memory pressure: Should handle gracefully with resource limits
- Poor audio quality: Should warn user about accuracy expectations
🚀 Implementation Phases
Phase 1: Core Foundation (Weeks 1-2)
Goal: Basic transcription working with CLI
- PostgreSQL database setup with JSONB - Schema created and tested
- YouTube metadata extraction with curl - Extract title, channel, description, duration using regex patterns
- Basic Whisper integration (v1) - 95% accuracy on test files
- Batch processing system - Handle 10+ files in parallel with error tracking
- CLI implementation with Click - All commands functional
- JSON/TXT export functionality - Both formats working
Phase 2: Enhancement (Week 3)
Goal: AI enhancement working reliably
- DeepSeek integration - API calls working with retry logic
- Enhancement templates - Structured prompts for technical content
- Progress tracking - Real-time updates in CLI
- Quality validation - Compare before/after accuracy
- Error recovery - Handle API failures gracefully
Phase 3: Roadmap - Multi-Pass Accuracy (v3)
Goal: Multi-pass accuracy improvements
- Multi-pass implementation - 3 passes with different parameters
- Confidence scoring - Per-segment confidence metrics
- Segment merging - Best segment selection algorithm
- Performance optimization - 3x speed improvement over v1
- Memory management - Handle large files efficiently
Phase 4: Roadmap - Speaker Diarization (v4)
Goal: Speaker diarization and scaling
- Speaker diarization - 90% speaker identification accuracy
- Voice embedding database - Speaker profile storage
- Caching layer - 50% cost reduction through caching
- API endpoints - REST API for integration
- Production deployment - Monitoring and logging
🔒 Security & Constraints
Security Requirements
- API Key Management: Secure storage of Whisper and DeepSeek API keys
- File Access: Local file system access only
- Data Protection: Encrypted storage for sensitive transcripts
- Input Sanitization: Validate all file paths and URLs
Performance Constraints
- Response Time: <30 seconds for 5-minute audio (v1)
- Throughput: Process 100+ files in batch
- Memory Usage: <8GB peak memory usage (M3 MacBook 16GB)
- Database Queries: <1 second for transcript retrieval
- Parallel Workers: 8 workers for optimal M3 performance
Technical Constraints
- File Formats: mp3, mp4, wav, m4a, webm only
- File Size: Maximum 500MB per file
- Audio Duration: Maximum 2 hours per file
- Network: Download-first, no streaming processing
- Storage: Local storage required, no cloud-only processing
- YouTube: Curl-based metadata extraction only
✅ Definition of Done
Feature Complete
- All acceptance criteria met with real test files
- Unit tests passing with >80% coverage
- Integration tests passing with actual services
- Code review completed
- Documentation updated in rule files
- Performance benchmarks met
Ready for Deployment
- Performance targets achieved (speed, accuracy, memory)
- Security review completed
- Error handling tested with edge cases
- User acceptance testing with real files
- Rollback plan prepared for each version
- Monitoring and logging configured
Trax-Specific Criteria
- Follows protocol-based architecture
- Uses download-first approach (no streaming)
- Implements proper error handling with actionable messages
- Maintains backward compatibility across versions
- Uses real files in tests (no mocks)
- Follows established rule files and patterns
- Handles tech podcast and academic lecture content effectively
This PRD is specifically designed for a personal research tool focused on tech podcasts, academic lectures, and audiobooks, with clear v1-v2 implementation and v3-v4 roadmap.