# Story 4.1: Dual Transcript Options (YouTube + Whisper) ## Story Overview **As a** user processing YouTube videos **I want** to choose between YouTube captions and AI Whisper transcription **So that** I can control the trade-off between speed/cost and transcription accuracy **Story ID**: 4.1 **Epic**: Epic 4 - Advanced Intelligence & Developer Platform **Priority**: High 🔥 **Effort Estimate**: 22 hours **Sprint**: Epic 4, Sprint 1 ## Business Value ### Problem Statement Current YouTube Summarizer relies solely on YouTube's captions, which often have quality issues: - Poor punctuation and capitalization - Incorrect technical terms and proper nouns - Missing speaker identification - Timing inaccuracies with rapid speech - Complete unavailability for some videos Users need the option to choose higher-accuracy AI transcription when quality matters more than speed. ### Solution Value - **User Control**: Choose optimal transcript source for each use case - **Quality Flexibility**: Fast YouTube captions vs accurate AI transcription - **Fallback Reliability**: Automatic Whisper backup when YouTube captions unavailable - **Transparency**: Clear cost/time implications for informed decisions - **Leveraged Assets**: Reuse proven TranscriptionService from archived projects ### Success Metrics - [ ] 80%+ users understand transcript option differences - [ ] 30%+ of users try Whisper transcription option - [ ] 25%+ improvement in transcript accuracy using Whisper - [ ] <5% user complaints about transcript quality - [ ] Zero failed transcriptions due to unavailable YouTube captions ## Acceptance Criteria ### AC 1: Transcript Source Selection UI **Given** I am processing a YouTube video **When** I access the transcript options **Then** I see three clear choices: - 📺 **YouTube Captions** (Fast, Free, 2-5 seconds) - 🎯 **AI Whisper** (Slower, Higher Quality, 30-120 seconds) - 🔄 **Compare Both** (Best Quality, Shows differences) **And** each option shows estimated processing time and quality level ### AC 2: YouTube Transcript Processing (Default) **Given** I select "YouTube Captions" option **When** I submit a video URL **Then** the system extracts transcript using existing YouTube API methods **And** processing completes in under 5 seconds **And** the transcript source is marked as "youtube" in database **And** quality score is calculated based on caption availability ### AC 3: Whisper Transcript Processing **Given** I select "AI Whisper" option **When** I submit a video URL **Then** the system downloads video audio using existing VideoDownloadService **And** processes audio through integrated TranscriptionService **And** returns high-quality transcript with timestamps **And** the transcript source is marked as "whisper" in database **And** processing time is clearly communicated to user ### AC 4: Dual Transcript Comparison **Given** I select "Compare Both" option **When** processing completes **Then** I see side-by-side transcript comparison **And** differences are highlighted (word accuracy, punctuation, technical terms) **And** quality metrics are shown for each transcript **And** I can switch between transcripts for summary generation ### AC 5: Automatic Fallback **Given** YouTube captions are unavailable for a video **When** I select "YouTube Captions" option **Then** system automatically falls back to Whisper transcription **And** user is notified of the fallback with processing time estimate **And** final result shows "whisper" as the source method ### AC 6: Quality and Cost Transparency **Given** I'm choosing transcript options **When** I view the selection interface **Then** I see clear indicators: - Processing time estimates (YouTube: 2-5s, Whisper: 30-120s) - Quality indicators (YouTube: "Standard", Whisper: "High Accuracy") - Availability status (YouTube: "May not be available", Whisper: "Always available") - Cost implications (YouTube: "Free", Whisper: "Uses compute resources") ## Technical Implementation ### Architecture Components #### 1. Real Whisper Service Integration ```python # File: backend/services/whisper_transcript_service.py class WhisperTranscriptService: """Real Whisper service replacing MockWhisperService""" def __init__(self, model_size: str = "small"): # Copy proven TranscriptionService from archived project self.transcription_service = TranscriptionService( repository=None, # YouTube context doesn't need episode storage model_size=model_size, device="auto" ) async def transcribe_audio(self, audio_path: Path) -> Dict[str, Any]: """Use proven Whisper implementation from archived project""" segments = await self.transcription_service._transcribe_audio_file( str(audio_path) ) return self._convert_to_api_format(segments) ``` #### 2. Enhanced Transcript Service ```python # File: backend/services/enhanced_transcript_service.py class DualTranscriptService(EnhancedTranscriptService): """Enhanced service with real Whisper integration""" async def extract_dual_transcripts( self, video_id: str ) -> Dict[str, TranscriptResult]: """Extract both YouTube and Whisper transcripts""" results = {} # YouTube transcript (parallel processing) youtube_task = asyncio.create_task( self._extract_youtube_transcript(video_id) ) # Whisper transcript (parallel processing) whisper_task = asyncio.create_task( self._extract_whisper_transcript(video_id) ) youtube_result, whisper_result = await asyncio.gather( youtube_task, whisper_task, return_exceptions=True ) return { 'youtube': youtube_result if not isinstance(youtube_result, Exception) else None, 'whisper': whisper_result if not isinstance(whisper_result, Exception) else None } ``` #### 3. Frontend Transcript Selector ```tsx // File: frontend/src/components/TranscriptSelector.tsx interface TranscriptSelectorProps { onSourceChange: (source: TranscriptSource) => void; estimatedDuration?: number; } const TranscriptSelector = ({ onSourceChange, estimatedDuration }: TranscriptSelectorProps) => { return (
); }; ``` #### 4. Transcript Comparison UI ```tsx // File: frontend/src/components/TranscriptComparison.tsx const TranscriptComparison = ({ youtubeTranscript, whisperTranscript }) => { return (

Transcript Quality Comparison

); }; ``` ### Database Schema Changes ```sql -- Extend summaries table for transcript metadata ALTER TABLE summaries ADD COLUMN transcript_source VARCHAR(20), -- 'youtube', 'whisper', 'both' ADD COLUMN transcript_quality_score FLOAT, ADD COLUMN youtube_transcript TEXT, ADD COLUMN whisper_transcript TEXT, ADD COLUMN whisper_processing_time FLOAT, ADD COLUMN transcript_comparison_data JSON; -- Create index for transcript source queries CREATE INDEX idx_summaries_transcript_source ON summaries(transcript_source); ``` ### API Endpoint Changes ```python # File: backend/api/transcript.py @router.post("/api/transcripts/dual/{video_id}") async def get_dual_transcripts( video_id: str, options: TranscriptOptionsRequest, background_tasks: BackgroundTasks, current_user: User = Depends(get_current_user) ): """Get transcripts with specified source option""" if options.source == "both": results = await dual_transcript_service.extract_dual_transcripts(video_id) return TranscriptComparisonResponse( video_id=video_id, transcripts=results, quality_comparison=analyze_transcript_differences(results) ) elif options.source == "whisper": transcript = await dual_transcript_service.extract_transcript( video_id, force_whisper=True ) return TranscriptResponse(video_id=video_id, transcript=transcript) else: # youtube (default) transcript = await dual_transcript_service.extract_transcript( video_id, force_whisper=False ) return TranscriptResponse(video_id=video_id, transcript=transcript) ``` ## Development Tasks ### Phase 1: Backend Foundation (8 hours) #### Task 1.1: Copy and Adapt TranscriptionService (3 hours) - [x] Copy TranscriptionService from `archived_projects/personal-ai-assistant-v1.1.0/src/services/transcription_service.py` - [ ] Remove podcast-specific database dependencies (PodcastEpisode, PodcastTranscript, Repository) - [ ] Adapt for YouTube video context (transcribe_episode → transcribe_audio_file) - [ ] Create WhisperTranscriptService wrapper class - [ ] Add async compatibility with asyncio.run_in_executor() - [ ] Test basic transcription functionality with sample audio #### Task 1.2: Update Dependencies and Environment (2 hours) - [ ] Add to backend/requirements.txt: openai-whisper, torch, librosa, pydub, soundfile - [ ] Update Dockerfile with ffmpeg system dependency - [ ] Test Whisper model download (small model ~244MB) - [ ] Configure environment variables (WHISPER_MODEL_SIZE, WHISPER_DEVICE) - [ ] Verify CUDA detection works (if available) #### Task 1.3: Replace MockWhisperService (3 hours) - [ ] Update EnhancedTranscriptService to use real WhisperTranscriptService - [ ] Remove MockWhisperService references throughout codebase - [ ] Update dependency injection in main.py and service factories - [ ] Test integration with existing VideoDownloadService audio extraction - [ ] Verify async compatibility with existing pipeline architecture ### Phase 2: API Enhancement (4 hours) #### Task 2.1: Create DualTranscriptService (2 hours) - [ ] Create DualTranscriptService class extending EnhancedTranscriptService - [ ] Implement extract_dual_transcripts() with parallel YouTube/Whisper processing - [ ] Add quality comparison algorithm (word-level diff, confidence scoring) - [ ] Implement separate caching strategy for YouTube vs Whisper results - [ ] Test parallel processing performance and error handling #### Task 2.2: Add New API Endpoints (2 hours) - [ ] Create TranscriptOptionsRequest model (source, whisper_model, language, timestamps) - [ ] Add /api/transcripts/dual/{video_id} endpoint for dual transcript processing - [ ] Update existing pipeline to accept transcript source preference - [ ] Add TranscriptComparisonResponse model for side-by-side results - [ ] Test all endpoint variations (youtube, whisper, both) ### Phase 3: Database Schema (2 hours) #### Task 3.1: Extend Summary Model (1 hour) - [ ] Create Alembic migration for new transcript fields - [ ] Add columns: transcript_source, transcript_quality_score, youtube_transcript, whisper_transcript, whisper_processing_time, transcript_comparison_data - [ ] Update Summary model class with new fields - [ ] Update repository methods for transcript metadata storage #### Task 3.2: Add Performance Indexes (1 hour) - [ ] Create indexes on transcript_source, quality_score, processing_time - [ ] Test query performance with sample data - [ ] Verify migration runs successfully on development database ### Phase 4: Frontend Implementation (6 hours) #### Task 4.1: Create TranscriptSelector Component (2 hours) - [ ] Create TranscriptSelector.tsx with Radix UI RadioGroup - [ ] Add processing time estimation based on video duration - [ ] Implement visual indicators (icons, quality badges, time estimates) - [ ] Add proper TypeScript interfaces and accessibility labels - [ ] Test responsive design and keyboard navigation #### Task 4.2: Integrate with SummarizeForm (1 hour) - [ ] Add transcript source state to SummarizeForm component - [ ] Update form submission to include transcript options - [ ] Add form validation for transcript source selection - [ ] Test integration with existing admin page workflow #### Task 4.3: Create TranscriptComparison Component (2 hours) - [ ] Create TranscriptComparison.tsx for side-by-side display - [ ] Implement word-level difference highlighting algorithm - [ ] Add quality metric badges and comparison controls - [ ] Create transcript selection buttons (Use YouTube/Use Whisper) - [ ] Test with real YouTube vs Whisper transcript differences #### Task 4.4: Update Processing UI (1 hour) - [ ] Update ProgressTracker to show transcript source and method - [ ] Add different progress messages for Whisper vs YouTube processing - [ ] Show estimated time remaining for Whisper transcription - [ ] Add error handling for Whisper processing failures with fallback notifications ### Phase 5: Testing and Integration (2 hours) #### Task 5.1: Unit Tests (1 hour) - [ ] Backend: test_whisper_transcription_accuracy, test_dual_transcript_comparison, test_automatic_fallback - [ ] Frontend: TranscriptSelector component tests, form integration tests - [ ] API: dual transcript endpoint tests, transcript option validation tests - [ ] Verify >80% test coverage for new code #### Task 5.2: Integration Testing (1 hour) - [ ] End-to-end test: YouTube vs Whisper quality comparison with real video - [ ] Admin page testing: verify transcript selector works without authentication - [ ] Error scenarios: unavailable YouTube captions (fallback), Whisper processing failure - [ ] Performance testing: benchmark Whisper processing times, verify cache effectiveness ## Testing Strategy ### Unit Tests ```python # Test file: backend/tests/unit/test_whisper_transcript_service.py def test_whisper_transcription_accuracy(): """Test Whisper transcription produces expected quality""" def test_dual_transcript_comparison(): """Test quality comparison algorithm""" def test_automatic_fallback(): """Test fallback when YouTube captions unavailable""" ``` ### Integration Tests ```python # Test file: backend/tests/integration/test_dual_transcript_api.py async def test_dual_transcript_endpoint(): """Test dual transcript API with real video""" async def test_transcript_source_selection(): """Test different source options return expected results""" ``` ### User Acceptance Testing - [ ] Test transcript quality comparison with 5 different video types - [ ] Verify processing time estimates are accurate within 20% - [ ] Confirm automatic fallback works for videos without captions - [ ] Validate quality metrics correlate with perceived accuracy ## Dependencies and Prerequisites ### External Dependencies - `openai-whisper`: Core Whisper transcription capability - `torch`: PyTorch for Whisper model execution - `librosa`: Audio processing and analysis - `pydub`: Audio format conversion and manipulation - `ffmpeg`: Audio/video processing (system dependency) ### Internal Dependencies - Epic 3 Complete: User authentication and session management - VideoDownloadService: Required for audio extraction - EnhancedTranscriptService: Base service for integration - WebSocket infrastructure: For real-time progress updates ### Hardware Requirements - **CPU**: Multi-core processor (Whisper is CPU-intensive) - **Memory**: 4GB+ RAM for "small" model, 8GB+ for "medium/large" - **Storage**: Additional space for downloaded audio files - **GPU**: Optional but recommended (CUDA support for faster processing) ## Performance Considerations ### Whisper Model Selection - **tiny**: Fastest, lowest accuracy (39 MB) - **base**: Good balance (74 MB) - **small**: Recommended default (244 MB) ⭐ - **medium**: Higher accuracy (769 MB) - **large**: Best accuracy (1550 MB) ### Optimization Strategies - Use "small" model by default for balance of speed/accuracy - Implement model caching to avoid reloading - Add GPU detection and automatic CUDA usage - Chunk long videos to prevent memory issues - Cache Whisper results aggressively (longer TTL than YouTube captions) ## Risk Mitigation ### High Risk Items 1. **Processing Time**: Whisper can be slow for long videos 2. **Resource Usage**: High CPU/memory consumption 3. **Model Downloads**: Large model files on first run 4. **Audio Quality**: Poor audio affects Whisper accuracy ### Mitigation Strategies 1. **Time Estimates**: Clear user expectations and progress indicators 2. **Resource Monitoring**: Implement processing limits and queue management 3. **Model Management**: Pre-download models in Docker image 4. **Quality Checks**: Audio preprocessing and noise reduction ## Success Criteria ### Definition of Done - [ ] Users can select between YouTube/Whisper/Both transcript options - [ ] Real Whisper transcription integrated from archived codebase - [ ] Processing time estimates accurate within 20% - [ ] Quality comparison shows meaningful differences - [ ] Automatic fallback works when YouTube captions unavailable - [ ] All tests pass with >80% code coverage - [ ] Performance acceptable (Whisper <2 minutes for 10-minute video) - [ ] UI provides clear feedback during processing - [ ] Database properly stores transcript metadata and quality scores ### Acceptance Testing Scenarios 1. **Standard Use Case**: Select Whisper for technical video, confirm accuracy improvement 2. **Comparison Mode**: Use "Compare Both" option, review side-by-side differences 3. **Fallback Scenario**: Process video without YouTube captions, verify Whisper fallback 4. **Long Video**: Process 30+ minute video, confirm chunking works properly 5. **Error Handling**: Test with corrupted audio, verify graceful error handling ## Post-Implementation Considerations ### Monitoring and Analytics - Track transcript source usage patterns - Monitor Whisper processing times and success rates - Collect user feedback on transcript quality satisfaction - Analyze cost implications of increased Whisper usage ### Future Enhancements - Custom Whisper model fine-tuning for specific domains - Speaker identification integration - Real-time transcription for live streams - Multi-language Whisper support beyond English --- **Story Owner**: Development Team **Reviewer**: Technical Lead **Epic Reference**: Epic 4 - Advanced Intelligence & Developer Platform **Story Status**: Ready for Implementation **Last Updated**: 2025-08-27