373 lines
16 KiB
Markdown
373 lines
16 KiB
Markdown
# Trax v1-v2 PRD: Personal Research Transcription Tool
|
|
|
|
## 🎯 Product Vision
|
|
*"We're building a personal transcription tool that enables researchers to batch-process tech podcasts, academic lectures, and audiobooks by downloading media locally and running high-accuracy transcription, resulting in searchable, structured text content for study and research."*
|
|
|
|
## 🏗️ System Architecture Overview
|
|
### Core Components
|
|
- **Data Layer**: PostgreSQL with JSONB, SQLAlchemy registry pattern
|
|
- **Business Logic**: Protocol-based services, async/await throughout
|
|
- **Interface Layer**: CLI-first with Click, batch processing focus
|
|
- **Integration Layer**: Download-first architecture, curl-based YouTube metadata
|
|
|
|
### System Boundaries
|
|
- **What's In Scope**: Local media processing, YouTube metadata extraction, batch transcription, JSON/TXT export
|
|
- **What's Out of Scope**: Real-time streaming, web UI, multi-user support, cloud processing
|
|
- **Integration Points**: Whisper API, DeepSeek API, FFmpeg, PostgreSQL, YouTube (curl)
|
|
|
|
## 👥 User Profile
|
|
### Primary User: Personal Researcher
|
|
- **Role**: Individual researcher processing educational content
|
|
- **Content Types**: Tech podcasts, academic lectures, audiobooks
|
|
- **Workflow**: Batch URL collection → Download → Transcribe → Study
|
|
- **Goals**: High accuracy transcripts, fast processing, searchable content
|
|
- **Constraints**: Local storage, API costs, processing time
|
|
|
|
## 🔧 Functional Requirements
|
|
|
|
### Feature 1: YouTube URL Processing
|
|
#### Purpose
|
|
Extract metadata from YouTube URLs using curl to avoid API complexity
|
|
|
|
#### User Stories
|
|
- **As a** researcher, **I want** to provide YouTube URLs, **so that** I can get video metadata and download links
|
|
- **As a** researcher, **I want** to batch process multiple URLs, **so that** I can queue up content for transcription
|
|
|
|
#### Acceptance Criteria
|
|
- [ ] **Given** a YouTube URL, **When** I run `trax youtube <url>`, **Then** I get title, channel, description, and duration
|
|
- [ ] **Given** a list of URLs, **When** I run `trax batch-urls <file>`, **Then** all metadata is extracted and stored
|
|
- [ ] **Given** invalid URLs, **When** I process them, **Then** clear error messages are shown
|
|
|
|
#### Input Validation Rules
|
|
- **URL format**: Must be valid YouTube URL - Error: "Invalid YouTube URL"
|
|
- **URL accessibility**: Must be publicly accessible - Error: "Video not accessible"
|
|
- **Rate limiting**: Max 10 URLs per minute - Error: "Rate limit exceeded"
|
|
|
|
#### Business Logic Rules
|
|
- **Rule 1**: Use curl with user-agent to avoid blocking
|
|
- **Rule 2**: Extract metadata using regex patterns targeting ytInitialPlayerResponse and ytInitialData objects
|
|
- **Rule 3**: Store metadata in PostgreSQL for future reference
|
|
- **Rule 4**: Generate unique filenames based on video ID and title
|
|
- **Rule 5**: Handle escaped characters in titles and descriptions using Perl regex patterns
|
|
|
|
#### Error Handling
|
|
- **Network Error**: Retry up to 3 times with exponential backoff
|
|
- **Invalid URL**: Skip and continue with remaining URLs
|
|
- **Rate Limited**: Wait 60 seconds before retrying
|
|
|
|
### Feature 2: Local Media Transcription (v1)
|
|
#### Purpose
|
|
High-accuracy transcription of downloaded media files using Whisper
|
|
|
|
#### User Stories
|
|
- **As a** researcher, **I want** to transcribe downloaded media, **so that** I can study the content
|
|
- **As a** researcher, **I want** batch processing, **so that** I can process multiple files efficiently
|
|
|
|
#### Acceptance Criteria
|
|
- [ ] **Given** a downloaded media file, **When** I run `trax transcribe <file>`, **Then** I get 95%+ accuracy transcript in <30 seconds
|
|
- [ ] **Given** a folder of media files, **When** I run `trax batch <folder>`, **Then** all files are processed with progress tracking
|
|
- [ ] **Given** poor audio quality, **When** I transcribe, **Then** I get a quality warning with accuracy estimate
|
|
|
|
#### Input Validation Rules
|
|
- **File format**: mp3, mp4, wav, m4a, webm - Error: "Unsupported format"
|
|
- **File size**: ≤500MB - Error: "File too large, max 500MB"
|
|
- **Audio duration**: >0.1 seconds - Error: "File too short or silent"
|
|
|
|
#### Business Logic Rules
|
|
- **Rule 1**: Always download media before processing (no streaming)
|
|
- **Rule 2**: Convert audio to 16kHz mono WAV for Whisper
|
|
- **Rule 3**: Use distil-large-v3 model with M3 optimizations
|
|
- **Rule 4**: Store results in PostgreSQL with JSONB for transcripts
|
|
|
|
#### Error Handling
|
|
- **Whisper Memory Error**: Implement chunking for files >10 minutes
|
|
- **Audio Quality**: Warn user if estimated accuracy <80%
|
|
- **Processing Failure**: Save partial results, allow retry from last successful stage
|
|
|
|
### Feature 3: AI Enhancement (v2)
|
|
#### Purpose
|
|
Improve transcript accuracy and readability using DeepSeek
|
|
|
|
#### User Stories
|
|
- **As a** researcher, **I want** enhanced transcripts, **so that** technical terms and punctuation are correct
|
|
- **As a** researcher, **I want** to compare original vs enhanced, **so that** I can verify improvements
|
|
|
|
#### Acceptance Criteria
|
|
- [ ] **Given** a v1 transcript, **When** I run enhancement, **Then** accuracy improves to ≥99%
|
|
- [ ] **Given** an enhanced transcript, **When** I compare to original, **Then** no content is lost
|
|
- [ ] **Given** enhancement fails, **When** I retry, **Then** original transcript is preserved
|
|
|
|
#### Input Validation Rules
|
|
- **Transcript format**: Must be valid JSON with segments - Error: "Invalid transcript format"
|
|
- **Enhancement model**: Must be available (DeepSeek API key) - Error: "Enhancement service unavailable"
|
|
|
|
#### Business Logic Rules
|
|
- **Rule 1**: Preserve timestamps and speaker markers during enhancement
|
|
- **Rule 2**: Use structured enhancement prompts for technical content
|
|
- **Rule 3**: Cache enhancement results for 7 days to reduce costs
|
|
|
|
#### Error Handling
|
|
- **API Rate Limit**: Queue enhancement for later processing
|
|
- **Enhancement Failure**: Return original transcript with error flag
|
|
- **Content Loss**: Validate enhancement preserves original length ±5%
|
|
|
|
## 🖥️ User Interface Flows
|
|
|
|
### Flow 1: YouTube URL Processing
|
|
#### Screen 1: URL Input
|
|
- **Purpose**: User provides YouTube URLs for processing
|
|
- **Elements**:
|
|
- URL input: Text input - Single URL or file path
|
|
- Batch option: Flag - --batch for multiple URLs
|
|
- Output format: Flag - --json, --txt for metadata export
|
|
- **Actions**:
|
|
- Enter: Process URL → Metadata display
|
|
- Ctrl+C: Cancel operation → Return to prompt
|
|
- **Validation**: URL format, accessibility, rate limits
|
|
- **Error States**: "Invalid URL", "Video not accessible", "Rate limit exceeded"
|
|
|
|
#### Screen 2: Metadata Display
|
|
- **Purpose**: Show extracted metadata and download options
|
|
- **Elements**:
|
|
- Video info: Text display - Title, channel, duration
|
|
- Download option: Flag - --download to save media
|
|
- Queue option: Flag - --queue for batch processing
|
|
- **Actions**:
|
|
- Download: Save media file → Download progress
|
|
- Queue: Add to batch queue → Queue confirmation
|
|
- Next: Process another URL → Return to input
|
|
|
|
### Flow 2: Batch Transcription
|
|
#### Screen 1: Batch Command Input
|
|
- **Purpose**: User initiates batch transcription
|
|
- **Elements**:
|
|
- Directory path: Text input - Folder containing media files
|
|
- Pipeline version: Flag - --v1, --v2 (default: v1)
|
|
- Parallel workers: Flag - --workers (default: 8 for M3 MacBook)
|
|
- Quality threshold: Flag - --min-accuracy (default: 80%)
|
|
- **Actions**:
|
|
- Enter: Start batch processing → Progress tracking
|
|
- Preview: Show file list → File list display
|
|
- **Validation**: Directory exists, contains supported files
|
|
|
|
#### Screen 2: Batch Progress
|
|
- **Purpose**: Show real-time processing status
|
|
- **Elements**:
|
|
- Overall progress: Progress bar - Total batch completion
|
|
- Current file: Text display - Currently processing file
|
|
- Quality metrics: Text display - Accuracy estimates
|
|
- Queue status: Text display - Files remaining, completed, failed
|
|
- **Actions**:
|
|
- Continue: Automatic progression → Results summary
|
|
- Pause: Suspend processing → Resume option
|
|
- **Validation**: Updates every 5 seconds, shows quality warnings
|
|
|
|
#### Screen 3: Results Summary
|
|
- **Purpose**: Show batch processing results
|
|
- **Elements**:
|
|
- Success count: Text display - Files processed successfully
|
|
- Failure count: Text display - Files that failed
|
|
- Quality report: Text display - Average accuracy, warnings
|
|
- Export options: Buttons - JSON, TXT, SRT formats
|
|
- **Actions**:
|
|
- Export: Save all transcripts → Export confirmation
|
|
- Retry failed: Re-process failed files → Retry progress
|
|
- New batch: Start over → Return to input
|
|
|
|
## 🔄 Data Flow & State Management
|
|
|
|
### Data Models (PostgreSQL Schema)
|
|
#### YouTubeVideo
|
|
```json
|
|
{
|
|
"id": "UUID (required, primary key)",
|
|
"youtube_id": "string (required, unique)",
|
|
"title": "string (required)",
|
|
"channel": "string (required)",
|
|
"description": "text (optional)",
|
|
"duration_seconds": "integer (required)",
|
|
"url": "string (required)",
|
|
"metadata_extracted_at": "timestamp (auto-generated)",
|
|
"created_at": "timestamp (auto-generated)"
|
|
}
|
|
```
|
|
|
|
#### MediaFile
|
|
```json
|
|
{
|
|
"id": "UUID (required, primary key)",
|
|
"youtube_video_id": "UUID (optional, foreign key)",
|
|
"local_path": "string (required, file location)",
|
|
"media_type": "string (required, mp3, mp4, wav, etc.)",
|
|
"duration_seconds": "integer (optional)",
|
|
"file_size_bytes": "bigint (required)",
|
|
"download_status": "enum (pending, downloading, completed, failed)",
|
|
"created_at": "timestamp (auto-generated)",
|
|
"updated_at": "timestamp (auto-updated)"
|
|
}
|
|
```
|
|
|
|
#### Transcript
|
|
```json
|
|
{
|
|
"id": "UUID (required, primary key)",
|
|
"media_file_id": "UUID (required, foreign key)",
|
|
"pipeline_version": "string (required, v1, v2)",
|
|
"raw_content": "JSONB (required, Whisper output)",
|
|
"enhanced_content": "JSONB (optional, AI enhanced)",
|
|
"text_content": "text (required, plain text for search)",
|
|
"model_used": "string (required, whisper model version)",
|
|
"processing_time_ms": "integer (required)",
|
|
"word_count": "integer (required)",
|
|
"accuracy_estimate": "float (optional, 0.0-1.0)",
|
|
"quality_warnings": "string array (optional)",
|
|
"processing_metadata": "JSONB (optional, version-specific data)",
|
|
"created_at": "timestamp (auto-generated)",
|
|
"enhanced_at": "timestamp (optional)",
|
|
"updated_at": "timestamp (auto-updated)"
|
|
}
|
|
```
|
|
|
|
### State Transitions
|
|
#### YouTubeVideo State Machine
|
|
```
|
|
[url_provided] → [metadata_extracting] → [metadata_complete]
|
|
[url_provided] → [metadata_extracting] → [metadata_failed]
|
|
```
|
|
|
|
#### MediaFile State Machine
|
|
```
|
|
[pending] → [downloading] → [completed]
|
|
[pending] → [downloading] → [failed] → [retry] → [downloading]
|
|
```
|
|
|
|
#### Transcript State Machine
|
|
```
|
|
[processing] → [completed]
|
|
[processing] → [failed] → [retry] → [processing]
|
|
[completed] → [enhancing] → [enhanced]
|
|
[enhanced] → [final]
|
|
```
|
|
|
|
### Data Validation Rules
|
|
- **Rule 1**: File size must be ≤500MB for processing
|
|
- **Rule 2**: Audio duration must be >0.1 seconds (not silent)
|
|
- **Rule 3**: Transcript must contain at least one segment
|
|
- **Rule 4**: Processing time must be >0 and <3600 seconds (1 hour)
|
|
- **Rule 5**: YouTube ID must be unique in database
|
|
|
|
## 🧪 Testing Requirements
|
|
|
|
### Unit Tests
|
|
- [ ] `test_youtube_metadata_extractor`: Extract metadata using curl and regex patterns
|
|
- [ ] `test_media_downloader`: Download from various sources
|
|
- [ ] `test_audio_preprocessor`: Convert to 16kHz mono WAV
|
|
- [ ] `test_whisper_service`: Basic transcription functionality
|
|
- [ ] `test_enhancement_service`: AI enhancement with DeepSeek
|
|
- [ ] `test_batch_processor`: Parallel file processing with error tracking
|
|
|
|
### Integration Tests
|
|
- [ ] `test_pipeline_v1`: End-to-end v1 transcription
|
|
- [ ] `test_pipeline_v2`: End-to-end v2 with enhancement
|
|
- [ ] `test_batch_processing`: Process 10 files in parallel
|
|
- [ ] `test_database_operations`: PostgreSQL CRUD operations
|
|
- [ ] `test_export_formats`: JSON and TXT export functionality
|
|
|
|
### Edge Cases
|
|
- [ ] Silent audio file: Should detect and report appropriately
|
|
- [ ] Corrupted media file: Should handle gracefully with clear error
|
|
- [ ] Network interruption during download: Should retry automatically
|
|
- [ ] Large file (>10 minutes): Should chunk automatically
|
|
- [ ] Memory pressure: Should handle gracefully with resource limits
|
|
- [ ] Poor audio quality: Should warn user about accuracy expectations
|
|
|
|
## 🚀 Implementation Phases
|
|
|
|
### Phase 1: Core Foundation (Weeks 1-2)
|
|
**Goal**: Basic transcription working with CLI
|
|
- [ ] PostgreSQL database setup with JSONB - Schema created and tested
|
|
- [ ] YouTube metadata extraction with curl - Extract title, channel, description, duration using regex patterns
|
|
- [ ] Basic Whisper integration (v1) - 95% accuracy on test files
|
|
- [ ] Batch processing system - Handle 10+ files in parallel with error tracking
|
|
- [ ] CLI implementation with Click - All commands functional
|
|
- [ ] JSON/TXT export functionality - Both formats working
|
|
|
|
### Phase 2: Enhancement (Week 3)
|
|
**Goal**: AI enhancement working reliably
|
|
- [ ] DeepSeek integration - API calls working with retry logic
|
|
- [ ] Enhancement templates - Structured prompts for technical content
|
|
- [ ] Progress tracking - Real-time updates in CLI
|
|
- [ ] Quality validation - Compare before/after accuracy
|
|
- [ ] Error recovery - Handle API failures gracefully
|
|
|
|
### Phase 3: Roadmap - Multi-Pass Accuracy (v3)
|
|
**Goal**: Multi-pass accuracy improvements
|
|
- [ ] Multi-pass implementation - 3 passes with different parameters
|
|
- [ ] Confidence scoring - Per-segment confidence metrics
|
|
- [ ] Segment merging - Best segment selection algorithm
|
|
- [ ] Performance optimization - 3x speed improvement over v1
|
|
- [ ] Memory management - Handle large files efficiently
|
|
|
|
### Phase 4: Roadmap - Speaker Diarization (v4)
|
|
**Goal**: Speaker diarization and scaling
|
|
- [ ] Speaker diarization - 90% speaker identification accuracy
|
|
- [ ] Voice embedding database - Speaker profile storage
|
|
- [ ] Caching layer - 50% cost reduction through caching
|
|
- [ ] API endpoints - REST API for integration
|
|
- [ ] Production deployment - Monitoring and logging
|
|
|
|
## 🔒 Security & Constraints
|
|
|
|
### Security Requirements
|
|
- **API Key Management**: Secure storage of Whisper and DeepSeek API keys
|
|
- **File Access**: Local file system access only
|
|
- **Data Protection**: Encrypted storage for sensitive transcripts
|
|
- **Input Sanitization**: Validate all file paths and URLs
|
|
|
|
### Performance Constraints
|
|
- **Response Time**: <30 seconds for 5-minute audio (v1)
|
|
- **Throughput**: Process 100+ files in batch
|
|
- **Memory Usage**: <8GB peak memory usage (M3 MacBook 16GB)
|
|
- **Database Queries**: <1 second for transcript retrieval
|
|
- **Parallel Workers**: 8 workers for optimal M3 performance
|
|
|
|
### Technical Constraints
|
|
- **File Formats**: mp3, mp4, wav, m4a, webm only
|
|
- **File Size**: Maximum 500MB per file
|
|
- **Audio Duration**: Maximum 2 hours per file
|
|
- **Network**: Download-first, no streaming processing
|
|
- **Storage**: Local storage required, no cloud-only processing
|
|
- **YouTube**: Curl-based metadata extraction only
|
|
|
|
## ✅ Definition of Done
|
|
|
|
### Feature Complete
|
|
- [ ] All acceptance criteria met with real test files
|
|
- [ ] Unit tests passing with >80% coverage
|
|
- [ ] Integration tests passing with actual services
|
|
- [ ] Code review completed
|
|
- [ ] Documentation updated in rule files
|
|
- [ ] Performance benchmarks met
|
|
|
|
### Ready for Deployment
|
|
- [ ] Performance targets achieved (speed, accuracy, memory)
|
|
- [ ] Security review completed
|
|
- [ ] Error handling tested with edge cases
|
|
- [ ] User acceptance testing with real files
|
|
- [ ] Rollback plan prepared for each version
|
|
- [ ] Monitoring and logging configured
|
|
|
|
### Trax-Specific Criteria
|
|
- [ ] Follows protocol-based architecture
|
|
- [ ] Uses download-first approach (no streaming)
|
|
- [ ] Implements proper error handling with actionable messages
|
|
- [ ] Maintains backward compatibility across versions
|
|
- [ ] Uses real files in tests (no mocks)
|
|
- [ ] Follows established rule files and patterns
|
|
- [ ] Handles tech podcast and academic lecture content effectively
|
|
|
|
---
|
|
|
|
*This PRD is specifically designed for a personal research tool focused on tech podcasts, academic lectures, and audiobooks, with clear v1-v2 implementation and v3-v4 roadmap.*
|
|
|