trax/.taskmaster/docs/prd-v1.0.md

16 KiB

Trax v1-v2 PRD: Personal Research Transcription Tool

🎯 Product Vision

"We're building a personal transcription tool that enables researchers to batch-process tech podcasts, academic lectures, and audiobooks by downloading media locally and running high-accuracy transcription, resulting in searchable, structured text content for study and research."

🏗️ System Architecture Overview

Core Components

  • Data Layer: PostgreSQL with JSONB, SQLAlchemy registry pattern
  • Business Logic: Protocol-based services, async/await throughout
  • Interface Layer: CLI-first with Click, batch processing focus
  • Integration Layer: Download-first architecture, curl-based YouTube metadata

System Boundaries

  • What's In Scope: Local media processing, YouTube metadata extraction, batch transcription, JSON/TXT export
  • What's Out of Scope: Real-time streaming, web UI, multi-user support, cloud processing
  • Integration Points: Whisper API, DeepSeek API, FFmpeg, PostgreSQL, YouTube (curl)

👥 User Profile

Primary User: Personal Researcher

  • Role: Individual researcher processing educational content
  • Content Types: Tech podcasts, academic lectures, audiobooks
  • Workflow: Batch URL collection → Download → Transcribe → Study
  • Goals: High accuracy transcripts, fast processing, searchable content
  • Constraints: Local storage, API costs, processing time

🔧 Functional Requirements

Feature 1: YouTube URL Processing

Purpose

Extract metadata from YouTube URLs using curl to avoid API complexity

User Stories

  • As a researcher, I want to provide YouTube URLs, so that I can get video metadata and download links
  • As a researcher, I want to batch process multiple URLs, so that I can queue up content for transcription

Acceptance Criteria

  • Given a YouTube URL, When I run trax youtube <url>, Then I get title, channel, description, and duration
  • Given a list of URLs, When I run trax batch-urls <file>, Then all metadata is extracted and stored
  • Given invalid URLs, When I process them, Then clear error messages are shown

Input Validation Rules

  • URL format: Must be valid YouTube URL - Error: "Invalid YouTube URL"
  • URL accessibility: Must be publicly accessible - Error: "Video not accessible"
  • Rate limiting: Max 10 URLs per minute - Error: "Rate limit exceeded"

Business Logic Rules

  • Rule 1: Use curl with user-agent to avoid blocking
  • Rule 2: Extract metadata using regex patterns targeting ytInitialPlayerResponse and ytInitialData objects
  • Rule 3: Store metadata in PostgreSQL for future reference
  • Rule 4: Generate unique filenames based on video ID and title
  • Rule 5: Handle escaped characters in titles and descriptions using Perl regex patterns

Error Handling

  • Network Error: Retry up to 3 times with exponential backoff
  • Invalid URL: Skip and continue with remaining URLs
  • Rate Limited: Wait 60 seconds before retrying

Feature 2: Local Media Transcription (v1)

Purpose

High-accuracy transcription of downloaded media files using Whisper

User Stories

  • As a researcher, I want to transcribe downloaded media, so that I can study the content
  • As a researcher, I want batch processing, so that I can process multiple files efficiently

Acceptance Criteria

  • Given a downloaded media file, When I run trax transcribe <file>, Then I get 95%+ accuracy transcript in <30 seconds
  • Given a folder of media files, When I run trax batch <folder>, Then all files are processed with progress tracking
  • Given poor audio quality, When I transcribe, Then I get a quality warning with accuracy estimate

Input Validation Rules

  • File format: mp3, mp4, wav, m4a, webm - Error: "Unsupported format"
  • File size: ≤500MB - Error: "File too large, max 500MB"
  • Audio duration: >0.1 seconds - Error: "File too short or silent"

Business Logic Rules

  • Rule 1: Always download media before processing (no streaming)
  • Rule 2: Convert audio to 16kHz mono WAV for Whisper
  • Rule 3: Use distil-large-v3 model with M3 optimizations
  • Rule 4: Store results in PostgreSQL with JSONB for transcripts

Error Handling

  • Whisper Memory Error: Implement chunking for files >10 minutes
  • Audio Quality: Warn user if estimated accuracy <80%
  • Processing Failure: Save partial results, allow retry from last successful stage

Feature 3: AI Enhancement (v2)

Purpose

Improve transcript accuracy and readability using DeepSeek

User Stories

  • As a researcher, I want enhanced transcripts, so that technical terms and punctuation are correct
  • As a researcher, I want to compare original vs enhanced, so that I can verify improvements

Acceptance Criteria

  • Given a v1 transcript, When I run enhancement, Then accuracy improves to ≥99%
  • Given an enhanced transcript, When I compare to original, Then no content is lost
  • Given enhancement fails, When I retry, Then original transcript is preserved

Input Validation Rules

  • Transcript format: Must be valid JSON with segments - Error: "Invalid transcript format"
  • Enhancement model: Must be available (DeepSeek API key) - Error: "Enhancement service unavailable"

Business Logic Rules

  • Rule 1: Preserve timestamps and speaker markers during enhancement
  • Rule 2: Use structured enhancement prompts for technical content
  • Rule 3: Cache enhancement results for 7 days to reduce costs

Error Handling

  • API Rate Limit: Queue enhancement for later processing
  • Enhancement Failure: Return original transcript with error flag
  • Content Loss: Validate enhancement preserves original length ±5%

🖥️ User Interface Flows

Flow 1: YouTube URL Processing

Screen 1: URL Input

  • Purpose: User provides YouTube URLs for processing
  • Elements:
    • URL input: Text input - Single URL or file path
    • Batch option: Flag - --batch for multiple URLs
    • Output format: Flag - --json, --txt for metadata export
  • Actions:
    • Enter: Process URL → Metadata display
    • Ctrl+C: Cancel operation → Return to prompt
  • Validation: URL format, accessibility, rate limits
  • Error States: "Invalid URL", "Video not accessible", "Rate limit exceeded"

Screen 2: Metadata Display

  • Purpose: Show extracted metadata and download options
  • Elements:
    • Video info: Text display - Title, channel, duration
    • Download option: Flag - --download to save media
    • Queue option: Flag - --queue for batch processing
  • Actions:
    • Download: Save media file → Download progress
    • Queue: Add to batch queue → Queue confirmation
    • Next: Process another URL → Return to input

Flow 2: Batch Transcription

Screen 1: Batch Command Input

  • Purpose: User initiates batch transcription
  • Elements:
    • Directory path: Text input - Folder containing media files
    • Pipeline version: Flag - --v1, --v2 (default: v1)
    • Parallel workers: Flag - --workers (default: 8 for M3 MacBook)
    • Quality threshold: Flag - --min-accuracy (default: 80%)
  • Actions:
    • Enter: Start batch processing → Progress tracking
    • Preview: Show file list → File list display
  • Validation: Directory exists, contains supported files

Screen 2: Batch Progress

  • Purpose: Show real-time processing status
  • Elements:
    • Overall progress: Progress bar - Total batch completion
    • Current file: Text display - Currently processing file
    • Quality metrics: Text display - Accuracy estimates
    • Queue status: Text display - Files remaining, completed, failed
  • Actions:
    • Continue: Automatic progression → Results summary
    • Pause: Suspend processing → Resume option
  • Validation: Updates every 5 seconds, shows quality warnings

Screen 3: Results Summary

  • Purpose: Show batch processing results
  • Elements:
    • Success count: Text display - Files processed successfully
    • Failure count: Text display - Files that failed
    • Quality report: Text display - Average accuracy, warnings
    • Export options: Buttons - JSON, TXT, SRT formats
  • Actions:
    • Export: Save all transcripts → Export confirmation
    • Retry failed: Re-process failed files → Retry progress
    • New batch: Start over → Return to input

🔄 Data Flow & State Management

Data Models (PostgreSQL Schema)

YouTubeVideo

{
  "id": "UUID (required, primary key)",
  "youtube_id": "string (required, unique)",
  "title": "string (required)",
  "channel": "string (required)",
  "description": "text (optional)",
  "duration_seconds": "integer (required)",
  "url": "string (required)",
  "metadata_extracted_at": "timestamp (auto-generated)",
  "created_at": "timestamp (auto-generated)"
}

MediaFile

{
  "id": "UUID (required, primary key)",
  "youtube_video_id": "UUID (optional, foreign key)",
  "local_path": "string (required, file location)",
  "media_type": "string (required, mp3, mp4, wav, etc.)",
  "duration_seconds": "integer (optional)",
  "file_size_bytes": "bigint (required)",
  "download_status": "enum (pending, downloading, completed, failed)",
  "created_at": "timestamp (auto-generated)",
  "updated_at": "timestamp (auto-updated)"
}

Transcript

{
  "id": "UUID (required, primary key)",
  "media_file_id": "UUID (required, foreign key)",
  "pipeline_version": "string (required, v1, v2)",
  "raw_content": "JSONB (required, Whisper output)",
  "enhanced_content": "JSONB (optional, AI enhanced)",
  "text_content": "text (required, plain text for search)",
  "model_used": "string (required, whisper model version)",
  "processing_time_ms": "integer (required)",
  "word_count": "integer (required)",
  "accuracy_estimate": "float (optional, 0.0-1.0)",
  "quality_warnings": "string array (optional)",
  "processing_metadata": "JSONB (optional, version-specific data)",
  "created_at": "timestamp (auto-generated)",
  "enhanced_at": "timestamp (optional)",
  "updated_at": "timestamp (auto-updated)"
}

State Transitions

YouTubeVideo State Machine

[url_provided] → [metadata_extracting] → [metadata_complete]
[url_provided] → [metadata_extracting] → [metadata_failed]

MediaFile State Machine

[pending] → [downloading] → [completed]
[pending] → [downloading] → [failed] → [retry] → [downloading]

Transcript State Machine

[processing] → [completed]
[processing] → [failed] → [retry] → [processing]
[completed] → [enhancing] → [enhanced]
[enhanced] → [final]

Data Validation Rules

  • Rule 1: File size must be ≤500MB for processing
  • Rule 2: Audio duration must be >0.1 seconds (not silent)
  • Rule 3: Transcript must contain at least one segment
  • Rule 4: Processing time must be >0 and <3600 seconds (1 hour)
  • Rule 5: YouTube ID must be unique in database

🧪 Testing Requirements

Unit Tests

  • test_youtube_metadata_extractor: Extract metadata using curl and regex patterns
  • test_media_downloader: Download from various sources
  • test_audio_preprocessor: Convert to 16kHz mono WAV
  • test_whisper_service: Basic transcription functionality
  • test_enhancement_service: AI enhancement with DeepSeek
  • test_batch_processor: Parallel file processing with error tracking

Integration Tests

  • test_pipeline_v1: End-to-end v1 transcription
  • test_pipeline_v2: End-to-end v2 with enhancement
  • test_batch_processing: Process 10 files in parallel
  • test_database_operations: PostgreSQL CRUD operations
  • test_export_formats: JSON and TXT export functionality

Edge Cases

  • Silent audio file: Should detect and report appropriately
  • Corrupted media file: Should handle gracefully with clear error
  • Network interruption during download: Should retry automatically
  • Large file (>10 minutes): Should chunk automatically
  • Memory pressure: Should handle gracefully with resource limits
  • Poor audio quality: Should warn user about accuracy expectations

🚀 Implementation Phases

Phase 1: Core Foundation (Weeks 1-2)

Goal: Basic transcription working with CLI

  • PostgreSQL database setup with JSONB - Schema created and tested
  • YouTube metadata extraction with curl - Extract title, channel, description, duration using regex patterns
  • Basic Whisper integration (v1) - 95% accuracy on test files
  • Batch processing system - Handle 10+ files in parallel with error tracking
  • CLI implementation with Click - All commands functional
  • JSON/TXT export functionality - Both formats working

Phase 2: Enhancement (Week 3)

Goal: AI enhancement working reliably

  • DeepSeek integration - API calls working with retry logic
  • Enhancement templates - Structured prompts for technical content
  • Progress tracking - Real-time updates in CLI
  • Quality validation - Compare before/after accuracy
  • Error recovery - Handle API failures gracefully

Phase 3: Roadmap - Multi-Pass Accuracy (v3)

Goal: Multi-pass accuracy improvements

  • Multi-pass implementation - 3 passes with different parameters
  • Confidence scoring - Per-segment confidence metrics
  • Segment merging - Best segment selection algorithm
  • Performance optimization - 3x speed improvement over v1
  • Memory management - Handle large files efficiently

Phase 4: Roadmap - Speaker Diarization (v4)

Goal: Speaker diarization and scaling

  • Speaker diarization - 90% speaker identification accuracy
  • Voice embedding database - Speaker profile storage
  • Caching layer - 50% cost reduction through caching
  • API endpoints - REST API for integration
  • Production deployment - Monitoring and logging

🔒 Security & Constraints

Security Requirements

  • API Key Management: Secure storage of Whisper and DeepSeek API keys
  • File Access: Local file system access only
  • Data Protection: Encrypted storage for sensitive transcripts
  • Input Sanitization: Validate all file paths and URLs

Performance Constraints

  • Response Time: <30 seconds for 5-minute audio (v1)
  • Throughput: Process 100+ files in batch
  • Memory Usage: <8GB peak memory usage (M3 MacBook 16GB)
  • Database Queries: <1 second for transcript retrieval
  • Parallel Workers: 8 workers for optimal M3 performance

Technical Constraints

  • File Formats: mp3, mp4, wav, m4a, webm only
  • File Size: Maximum 500MB per file
  • Audio Duration: Maximum 2 hours per file
  • Network: Download-first, no streaming processing
  • Storage: Local storage required, no cloud-only processing
  • YouTube: Curl-based metadata extraction only

Definition of Done

Feature Complete

  • All acceptance criteria met with real test files
  • Unit tests passing with >80% coverage
  • Integration tests passing with actual services
  • Code review completed
  • Documentation updated in rule files
  • Performance benchmarks met

Ready for Deployment

  • Performance targets achieved (speed, accuracy, memory)
  • Security review completed
  • Error handling tested with edge cases
  • User acceptance testing with real files
  • Rollback plan prepared for each version
  • Monitoring and logging configured

Trax-Specific Criteria

  • Follows protocol-based architecture
  • Uses download-first approach (no streaming)
  • Implements proper error handling with actionable messages
  • Maintains backward compatibility across versions
  • Uses real files in tests (no mocks)
  • Follows established rule files and patterns
  • Handles tech podcast and academic lecture content effectively

This PRD is specifically designed for a personal research tool focused on tech podcasts, academic lectures, and audiobooks, with clear v1-v2 implementation and v3-v4 roadmap.