20 KiB

Raw Blame History

Story 4.1: Dual Transcript Options (YouTube + Whisper)

Story Overview

As a user processing YouTube videos
I want to choose between YouTube captions and AI Whisper transcription
So that I can control the trade-off between speed/cost and transcription accuracy

Story ID: 4.1
Epic: Epic 4 - Advanced Intelligence & Developer Platform
Priority: High 🔥
Effort Estimate: 22 hours
Sprint: Epic 4, Sprint 1

Business Value

Problem Statement

Current YouTube Summarizer relies solely on YouTube's captions, which often have quality issues:

Poor punctuation and capitalization
Incorrect technical terms and proper nouns
Missing speaker identification
Timing inaccuracies with rapid speech
Complete unavailability for some videos

Users need the option to choose higher-accuracy AI transcription when quality matters more than speed.

Solution Value

User Control: Choose optimal transcript source for each use case
Quality Flexibility: Fast YouTube captions vs accurate AI transcription
Fallback Reliability: Automatic Whisper backup when YouTube captions unavailable
Transparency: Clear cost/time implications for informed decisions
Leveraged Assets: Reuse proven TranscriptionService from archived projects

Success Metrics

80%+ users understand transcript option differences
30%+ of users try Whisper transcription option
25%+ improvement in transcript accuracy using Whisper
<5% user complaints about transcript quality
Zero failed transcriptions due to unavailable YouTube captions

Acceptance Criteria

AC 1: Transcript Source Selection UI

Given I am processing a YouTube video
When I access the transcript options
Then I see three clear choices:

📺 YouTube Captions (Fast, Free, 2-5 seconds)
🎯 AI Whisper (Slower, Higher Quality, 30-120 seconds)
🔄 Compare Both (Best Quality, Shows differences)

And each option shows estimated processing time and quality level

AC 2: YouTube Transcript Processing (Default)

Given I select "YouTube Captions" option
When I submit a video URL
Then the system extracts transcript using existing YouTube API methods And processing completes in under 5 seconds And the transcript source is marked as "youtube" in database And quality score is calculated based on caption availability

AC 3: Whisper Transcript Processing

Given I select "AI Whisper" option
When I submit a video URL
Then the system downloads video audio using existing VideoDownloadService And processes audio through integrated TranscriptionService
And returns high-quality transcript with timestamps And the transcript source is marked as "whisper" in database And processing time is clearly communicated to user

AC 4: Dual Transcript Comparison

Given I select "Compare Both" option
When processing completes
Then I see side-by-side transcript comparison And differences are highlighted (word accuracy, punctuation, technical terms) And quality metrics are shown for each transcript And I can switch between transcripts for summary generation

AC 5: Automatic Fallback

Given YouTube captions are unavailable for a video
When I select "YouTube Captions" option
Then system automatically falls back to Whisper transcription And user is notified of the fallback with processing time estimate And final result shows "whisper" as the source method

AC 6: Quality and Cost Transparency

Given I'm choosing transcript options
When I view the selection interface
Then I see clear indicators:

Processing time estimates (YouTube: 2-5s, Whisper: 30-120s)
Quality indicators (YouTube: "Standard", Whisper: "High Accuracy")
Availability status (YouTube: "May not be available", Whisper: "Always available")
Cost implications (YouTube: "Free", Whisper: "Uses compute resources")

Technical Implementation

Architecture Components

1. Real Whisper Service Integration

# File: backend/services/whisper_transcript_service.py
class WhisperTranscriptService:
    """Real Whisper service replacing MockWhisperService"""
    
    def __init__(self, model_size: str = "small"):
        # Copy proven TranscriptionService from archived project
        self.transcription_service = TranscriptionService(
            repository=None,  # YouTube context doesn't need episode storage
            model_size=model_size,
            device="auto"
        )
    
    async def transcribe_audio(self, audio_path: Path) -> Dict[str, Any]:
        """Use proven Whisper implementation from archived project"""
        segments = await self.transcription_service._transcribe_audio_file(
            str(audio_path)
        )
        return self._convert_to_api_format(segments)

2. Enhanced Transcript Service

# File: backend/services/enhanced_transcript_service.py
class DualTranscriptService(EnhancedTranscriptService):
    """Enhanced service with real Whisper integration"""
    
    async def extract_dual_transcripts(
        self, video_id: str
    ) -> Dict[str, TranscriptResult]:
        """Extract both YouTube and Whisper transcripts"""
        
        results = {}
        
        # YouTube transcript (parallel processing)
        youtube_task = asyncio.create_task(
            self._extract_youtube_transcript(video_id)
        )
        
        # Whisper transcript (parallel processing)
        whisper_task = asyncio.create_task(
            self._extract_whisper_transcript(video_id)
        )
        
        youtube_result, whisper_result = await asyncio.gather(
            youtube_task, whisper_task, return_exceptions=True
        )
        
        return {
            'youtube': youtube_result if not isinstance(youtube_result, Exception) else None,
            'whisper': whisper_result if not isinstance(whisper_result, Exception) else None
        }

3. Frontend Transcript Selector

// File: frontend/src/components/TranscriptSelector.tsx
interface TranscriptSelectorProps {
  onSourceChange: (source: TranscriptSource) => void;
  estimatedDuration?: number;
}

const TranscriptSelector = ({ onSourceChange, estimatedDuration }: TranscriptSelectorProps) => {
  return (
    <div className="transcript-selector">
      <RadioGroup onValueChange={onSourceChange}>
        <div className="transcript-option">
          <RadioGroupItem value="youtube" />
          <Label>
            📺 YouTube Captions
            <span className="option-details">Fast, Free • 2-5 seconds</span>
            <span className="quality-indicator">Standard Quality</span>
          </Label>
        </div>
        
        <div className="transcript-option">
          <RadioGroupItem value="whisper" />
          <Label>
            🎯 AI Whisper Transcription
            <span className="option-details">
              Higher Accuracy • {estimateWhisperTime(estimatedDuration)}
            </span>
            <span className="quality-indicator">High Accuracy</span>
          </Label>
        </div>
        
        <div className="transcript-option">
          <RadioGroupItem value="both" />
          <Label>
            🔄 Compare Both
            <span className="option-details">Best Quality • See differences</span>
            <span className="quality-indicator">Quality Comparison</span>
          </Label>
        </div>
      </RadioGroup>
    </div>
  );
};

4. Transcript Comparison UI

// File: frontend/src/components/TranscriptComparison.tsx
const TranscriptComparison = ({ youtubeTranscript, whisperTranscript }) => {
  return (
    <div className="transcript-comparison">
      <div className="comparison-header">
        <h3>Transcript Quality Comparison</h3>
        <div className="quality-metrics">
          <QualityBadge source="youtube" score={youtubeTranscript.qualityScore} />
          <QualityBadge source="whisper" score={whisperTranscript.qualityScore} />
        </div>
      </div>
      
      <div className="side-by-side-comparison">
        <TranscriptPanel
          title="📺 YouTube Captions"
          transcript={youtubeTranscript}
          highlightDifferences={true}
        />
        <TranscriptPanel
          title="🎯 AI Whisper"
          transcript={whisperTranscript}
          highlightDifferences={true}
        />
      </div>
      
      <div className="comparison-controls">
        <Button onClick={() => selectTranscript('youtube')}>
          Use YouTube Transcript
        </Button>
        <Button onClick={() => selectTranscript('whisper')}>
          Use Whisper Transcript
        </Button>
      </div>
    </div>
  );
};

Database Schema Changes

-- Extend summaries table for transcript metadata
ALTER TABLE summaries 
ADD COLUMN transcript_source VARCHAR(20), -- 'youtube', 'whisper', 'both'
ADD COLUMN transcript_quality_score FLOAT,
ADD COLUMN youtube_transcript TEXT,
ADD COLUMN whisper_transcript TEXT,
ADD COLUMN whisper_processing_time FLOAT,
ADD COLUMN transcript_comparison_data JSON;

-- Create index for transcript source queries
CREATE INDEX idx_summaries_transcript_source ON summaries(transcript_source);

API Endpoint Changes

# File: backend/api/transcript.py
@router.post("/api/transcripts/dual/{video_id}")
async def get_dual_transcripts(
    video_id: str,
    options: TranscriptOptionsRequest,
    background_tasks: BackgroundTasks,
    current_user: User = Depends(get_current_user)
):
    """Get transcripts with specified source option"""
    
    if options.source == "both":
        results = await dual_transcript_service.extract_dual_transcripts(video_id)
        return TranscriptComparisonResponse(
            video_id=video_id,
            transcripts=results,
            quality_comparison=analyze_transcript_differences(results)
        )
    
    elif options.source == "whisper":
        transcript = await dual_transcript_service.extract_transcript(
            video_id, 
            force_whisper=True
        )
        return TranscriptResponse(video_id=video_id, transcript=transcript)
    
    else:  # youtube (default)
        transcript = await dual_transcript_service.extract_transcript(
            video_id, 
            force_whisper=False
        )
        return TranscriptResponse(video_id=video_id, transcript=transcript)

Development Tasks

Phase 1: Backend Foundation (8 hours)

Task 1.1: Copy and Adapt TranscriptionService (3 hours)

Copy TranscriptionService from archived_projects/personal-ai-assistant-v1.1.0/src/services/transcription_service.py
Remove podcast-specific database dependencies (PodcastEpisode, PodcastTranscript, Repository)
Adapt for YouTube video context (transcribe_episode → transcribe_audio_file)
Create WhisperTranscriptService wrapper class
Add async compatibility with asyncio.run_in_executor()
Test basic transcription functionality with sample audio

Task 1.2: Update Dependencies and Environment (2 hours)

Add to backend/requirements.txt: openai-whisper, torch, librosa, pydub, soundfile
Update Dockerfile with ffmpeg system dependency
Test Whisper model download (small model ~244MB)
Configure environment variables (WHISPER_MODEL_SIZE, WHISPER_DEVICE)
Verify CUDA detection works (if available)

Task 1.3: Replace MockWhisperService (3 hours)

Update EnhancedTranscriptService to use real WhisperTranscriptService
Remove MockWhisperService references throughout codebase
Update dependency injection in main.py and service factories
Test integration with existing VideoDownloadService audio extraction
Verify async compatibility with existing pipeline architecture

Phase 2: API Enhancement (4 hours)

Task 2.1: Create DualTranscriptService (2 hours)

Create DualTranscriptService class extending EnhancedTranscriptService
Implement extract_dual_transcripts() with parallel YouTube/Whisper processing
Add quality comparison algorithm (word-level diff, confidence scoring)
Implement separate caching strategy for YouTube vs Whisper results
Test parallel processing performance and error handling

Task 2.2: Add New API Endpoints (2 hours)

Create TranscriptOptionsRequest model (source, whisper_model, language, timestamps)
Add /api/transcripts/dual/{video_id} endpoint for dual transcript processing
Update existing pipeline to accept transcript source preference
Add TranscriptComparisonResponse model for side-by-side results
Test all endpoint variations (youtube, whisper, both)

Phase 3: Database Schema (2 hours)

Task 3.1: Extend Summary Model (1 hour)

Create Alembic migration for new transcript fields
Add columns: transcript_source, transcript_quality_score, youtube_transcript, whisper_transcript, whisper_processing_time, transcript_comparison_data
Update Summary model class with new fields
Update repository methods for transcript metadata storage

Task 3.2: Add Performance Indexes (1 hour)

Create indexes on transcript_source, quality_score, processing_time
Test query performance with sample data
Verify migration runs successfully on development database

Phase 4: Frontend Implementation (6 hours)

Task 4.1: Create TranscriptSelector Component (2 hours)

Create TranscriptSelector.tsx with Radix UI RadioGroup
Add processing time estimation based on video duration
Implement visual indicators (icons, quality badges, time estimates)
Add proper TypeScript interfaces and accessibility labels
Test responsive design and keyboard navigation

Task 4.2: Integrate with SummarizeForm (1 hour)

Add transcript source state to SummarizeForm component
Update form submission to include transcript options
Add form validation for transcript source selection
Test integration with existing admin page workflow

Task 4.3: Create TranscriptComparison Component (2 hours)

Create TranscriptComparison.tsx for side-by-side display
Implement word-level difference highlighting algorithm
Add quality metric badges and comparison controls
Create transcript selection buttons (Use YouTube/Use Whisper)
Test with real YouTube vs Whisper transcript differences

Task 4.4: Update Processing UI (1 hour)

Update ProgressTracker to show transcript source and method
Add different progress messages for Whisper vs YouTube processing
Show estimated time remaining for Whisper transcription
Add error handling for Whisper processing failures with fallback notifications

Phase 5: Testing and Integration (2 hours)

Task 5.1: Unit Tests (1 hour)

Backend: test_whisper_transcription_accuracy, test_dual_transcript_comparison, test_automatic_fallback
Frontend: TranscriptSelector component tests, form integration tests
API: dual transcript endpoint tests, transcript option validation tests
Verify >80% test coverage for new code

Task 5.2: Integration Testing (1 hour)

End-to-end test: YouTube vs Whisper quality comparison with real video
Admin page testing: verify transcript selector works without authentication
Error scenarios: unavailable YouTube captions (fallback), Whisper processing failure
Performance testing: benchmark Whisper processing times, verify cache effectiveness

Testing Strategy

Unit Tests

# Test file: backend/tests/unit/test_whisper_transcript_service.py
def test_whisper_transcription_accuracy():
    """Test Whisper transcription produces expected quality"""
    
def test_dual_transcript_comparison():
    """Test quality comparison algorithm"""
    
def test_automatic_fallback():
    """Test fallback when YouTube captions unavailable"""

Integration Tests

# Test file: backend/tests/integration/test_dual_transcript_api.py  
async def test_dual_transcript_endpoint():
    """Test dual transcript API with real video"""
    
async def test_transcript_source_selection():
    """Test different source options return expected results"""

User Acceptance Testing

Test transcript quality comparison with 5 different video types
Verify processing time estimates are accurate within 20%
Confirm automatic fallback works for videos without captions
Validate quality metrics correlate with perceived accuracy

Dependencies and Prerequisites

External Dependencies

openai-whisper: Core Whisper transcription capability
torch: PyTorch for Whisper model execution
librosa: Audio processing and analysis
pydub: Audio format conversion and manipulation
ffmpeg: Audio/video processing (system dependency)

Internal Dependencies

Epic 3 Complete: User authentication and session management
VideoDownloadService: Required for audio extraction
EnhancedTranscriptService: Base service for integration
WebSocket infrastructure: For real-time progress updates

Hardware Requirements

CPU: Multi-core processor (Whisper is CPU-intensive)
Memory: 4GB+ RAM for "small" model, 8GB+ for "medium/large"
Storage: Additional space for downloaded audio files
GPU: Optional but recommended (CUDA support for faster processing)

Performance Considerations

Whisper Model Selection

tiny: Fastest, lowest accuracy (39 MB)
base: Good balance (74 MB)
small: Recommended default (244 MB) ⭐
medium: Higher accuracy (769 MB)
large: Best accuracy (1550 MB)

Optimization Strategies

Use "small" model by default for balance of speed/accuracy
Implement model caching to avoid reloading
Add GPU detection and automatic CUDA usage
Chunk long videos to prevent memory issues
Cache Whisper results aggressively (longer TTL than YouTube captions)

Risk Mitigation

High Risk Items

Processing Time: Whisper can be slow for long videos
Resource Usage: High CPU/memory consumption
Model Downloads: Large model files on first run
Audio Quality: Poor audio affects Whisper accuracy

Mitigation Strategies

Time Estimates: Clear user expectations and progress indicators
Resource Monitoring: Implement processing limits and queue management
Model Management: Pre-download models in Docker image
Quality Checks: Audio preprocessing and noise reduction

Success Criteria

Definition of Done

Users can select between YouTube/Whisper/Both transcript options
Real Whisper transcription integrated from archived codebase
Processing time estimates accurate within 20%
Quality comparison shows meaningful differences
Automatic fallback works when YouTube captions unavailable
All tests pass with >80% code coverage
Performance acceptable (Whisper <2 minutes for 10-minute video)
UI provides clear feedback during processing
Database properly stores transcript metadata and quality scores

Acceptance Testing Scenarios

Standard Use Case: Select Whisper for technical video, confirm accuracy improvement
Comparison Mode: Use "Compare Both" option, review side-by-side differences
Fallback Scenario: Process video without YouTube captions, verify Whisper fallback
Long Video: Process 30+ minute video, confirm chunking works properly
Error Handling: Test with corrupted audio, verify graceful error handling

Post-Implementation Considerations

Monitoring and Analytics

Track transcript source usage patterns
Monitor Whisper processing times and success rates
Collect user feedback on transcript quality satisfaction
Analyze cost implications of increased Whisper usage

Future Enhancements

Custom Whisper model fine-tuning for specific domains
Speaker identification integration
Real-time transcription for live streams
Multi-language Whisper support beyond English

Story Owner: Development Team
Reviewer: Technical Lead
Epic Reference: Epic 4 - Advanced Intelligence & Developer Platform
Story Status: Ready for Implementation
Last Updated: 2025-08-27

20 KiB Raw Blame History