20 KiB
Story 4.1: Dual Transcript Options (YouTube + Whisper)
Story Overview
As a user processing YouTube videos
I want to choose between YouTube captions and AI Whisper transcription
So that I can control the trade-off between speed/cost and transcription accuracy
Story ID: 4.1
Epic: Epic 4 - Advanced Intelligence & Developer Platform
Priority: High 🔥
Effort Estimate: 22 hours
Sprint: Epic 4, Sprint 1
Business Value
Problem Statement
Current YouTube Summarizer relies solely on YouTube's captions, which often have quality issues:
- Poor punctuation and capitalization
- Incorrect technical terms and proper nouns
- Missing speaker identification
- Timing inaccuracies with rapid speech
- Complete unavailability for some videos
Users need the option to choose higher-accuracy AI transcription when quality matters more than speed.
Solution Value
- User Control: Choose optimal transcript source for each use case
- Quality Flexibility: Fast YouTube captions vs accurate AI transcription
- Fallback Reliability: Automatic Whisper backup when YouTube captions unavailable
- Transparency: Clear cost/time implications for informed decisions
- Leveraged Assets: Reuse proven TranscriptionService from archived projects
Success Metrics
- 80%+ users understand transcript option differences
- 30%+ of users try Whisper transcription option
- 25%+ improvement in transcript accuracy using Whisper
- <5% user complaints about transcript quality
- Zero failed transcriptions due to unavailable YouTube captions
Acceptance Criteria
AC 1: Transcript Source Selection UI
Given I am processing a YouTube video
When I access the transcript options
Then I see three clear choices:
- 📺 YouTube Captions (Fast, Free, 2-5 seconds)
- 🎯 AI Whisper (Slower, Higher Quality, 30-120 seconds)
- 🔄 Compare Both (Best Quality, Shows differences)
And each option shows estimated processing time and quality level
AC 2: YouTube Transcript Processing (Default)
Given I select "YouTube Captions" option
When I submit a video URL
Then the system extracts transcript using existing YouTube API methods
And processing completes in under 5 seconds
And the transcript source is marked as "youtube" in database
And quality score is calculated based on caption availability
AC 3: Whisper Transcript Processing
Given I select "AI Whisper" option
When I submit a video URL
Then the system downloads video audio using existing VideoDownloadService
And processes audio through integrated TranscriptionService
And returns high-quality transcript with timestamps
And the transcript source is marked as "whisper" in database
And processing time is clearly communicated to user
AC 4: Dual Transcript Comparison
Given I select "Compare Both" option
When processing completes
Then I see side-by-side transcript comparison
And differences are highlighted (word accuracy, punctuation, technical terms)
And quality metrics are shown for each transcript
And I can switch between transcripts for summary generation
AC 5: Automatic Fallback
Given YouTube captions are unavailable for a video
When I select "YouTube Captions" option
Then system automatically falls back to Whisper transcription
And user is notified of the fallback with processing time estimate
And final result shows "whisper" as the source method
AC 6: Quality and Cost Transparency
Given I'm choosing transcript options
When I view the selection interface
Then I see clear indicators:
- Processing time estimates (YouTube: 2-5s, Whisper: 30-120s)
- Quality indicators (YouTube: "Standard", Whisper: "High Accuracy")
- Availability status (YouTube: "May not be available", Whisper: "Always available")
- Cost implications (YouTube: "Free", Whisper: "Uses compute resources")
Technical Implementation
Architecture Components
1. Real Whisper Service Integration
# File: backend/services/whisper_transcript_service.py
class WhisperTranscriptService:
"""Real Whisper service replacing MockWhisperService"""
def __init__(self, model_size: str = "small"):
# Copy proven TranscriptionService from archived project
self.transcription_service = TranscriptionService(
repository=None, # YouTube context doesn't need episode storage
model_size=model_size,
device="auto"
)
async def transcribe_audio(self, audio_path: Path) -> Dict[str, Any]:
"""Use proven Whisper implementation from archived project"""
segments = await self.transcription_service._transcribe_audio_file(
str(audio_path)
)
return self._convert_to_api_format(segments)
2. Enhanced Transcript Service
# File: backend/services/enhanced_transcript_service.py
class DualTranscriptService(EnhancedTranscriptService):
"""Enhanced service with real Whisper integration"""
async def extract_dual_transcripts(
self, video_id: str
) -> Dict[str, TranscriptResult]:
"""Extract both YouTube and Whisper transcripts"""
results = {}
# YouTube transcript (parallel processing)
youtube_task = asyncio.create_task(
self._extract_youtube_transcript(video_id)
)
# Whisper transcript (parallel processing)
whisper_task = asyncio.create_task(
self._extract_whisper_transcript(video_id)
)
youtube_result, whisper_result = await asyncio.gather(
youtube_task, whisper_task, return_exceptions=True
)
return {
'youtube': youtube_result if not isinstance(youtube_result, Exception) else None,
'whisper': whisper_result if not isinstance(whisper_result, Exception) else None
}
3. Frontend Transcript Selector
// File: frontend/src/components/TranscriptSelector.tsx
interface TranscriptSelectorProps {
onSourceChange: (source: TranscriptSource) => void;
estimatedDuration?: number;
}
const TranscriptSelector = ({ onSourceChange, estimatedDuration }: TranscriptSelectorProps) => {
return (
<div className="transcript-selector">
<RadioGroup onValueChange={onSourceChange}>
<div className="transcript-option">
<RadioGroupItem value="youtube" />
<Label>
📺 YouTube Captions
<span className="option-details">Fast, Free • 2-5 seconds</span>
<span className="quality-indicator">Standard Quality</span>
</Label>
</div>
<div className="transcript-option">
<RadioGroupItem value="whisper" />
<Label>
🎯 AI Whisper Transcription
<span className="option-details">
Higher Accuracy • {estimateWhisperTime(estimatedDuration)}
</span>
<span className="quality-indicator">High Accuracy</span>
</Label>
</div>
<div className="transcript-option">
<RadioGroupItem value="both" />
<Label>
🔄 Compare Both
<span className="option-details">Best Quality • See differences</span>
<span className="quality-indicator">Quality Comparison</span>
</Label>
</div>
</RadioGroup>
</div>
);
};
4. Transcript Comparison UI
// File: frontend/src/components/TranscriptComparison.tsx
const TranscriptComparison = ({ youtubeTranscript, whisperTranscript }) => {
return (
<div className="transcript-comparison">
<div className="comparison-header">
<h3>Transcript Quality Comparison</h3>
<div className="quality-metrics">
<QualityBadge source="youtube" score={youtubeTranscript.qualityScore} />
<QualityBadge source="whisper" score={whisperTranscript.qualityScore} />
</div>
</div>
<div className="side-by-side-comparison">
<TranscriptPanel
title="📺 YouTube Captions"
transcript={youtubeTranscript}
highlightDifferences={true}
/>
<TranscriptPanel
title="🎯 AI Whisper"
transcript={whisperTranscript}
highlightDifferences={true}
/>
</div>
<div className="comparison-controls">
<Button onClick={() => selectTranscript('youtube')}>
Use YouTube Transcript
</Button>
<Button onClick={() => selectTranscript('whisper')}>
Use Whisper Transcript
</Button>
</div>
</div>
);
};
Database Schema Changes
-- Extend summaries table for transcript metadata
ALTER TABLE summaries
ADD COLUMN transcript_source VARCHAR(20), -- 'youtube', 'whisper', 'both'
ADD COLUMN transcript_quality_score FLOAT,
ADD COLUMN youtube_transcript TEXT,
ADD COLUMN whisper_transcript TEXT,
ADD COLUMN whisper_processing_time FLOAT,
ADD COLUMN transcript_comparison_data JSON;
-- Create index for transcript source queries
CREATE INDEX idx_summaries_transcript_source ON summaries(transcript_source);
API Endpoint Changes
# File: backend/api/transcript.py
@router.post("/api/transcripts/dual/{video_id}")
async def get_dual_transcripts(
video_id: str,
options: TranscriptOptionsRequest,
background_tasks: BackgroundTasks,
current_user: User = Depends(get_current_user)
):
"""Get transcripts with specified source option"""
if options.source == "both":
results = await dual_transcript_service.extract_dual_transcripts(video_id)
return TranscriptComparisonResponse(
video_id=video_id,
transcripts=results,
quality_comparison=analyze_transcript_differences(results)
)
elif options.source == "whisper":
transcript = await dual_transcript_service.extract_transcript(
video_id,
force_whisper=True
)
return TranscriptResponse(video_id=video_id, transcript=transcript)
else: # youtube (default)
transcript = await dual_transcript_service.extract_transcript(
video_id,
force_whisper=False
)
return TranscriptResponse(video_id=video_id, transcript=transcript)
Development Tasks
Phase 1: Backend Foundation (8 hours)
Task 1.1: Copy and Adapt TranscriptionService (3 hours)
- Copy TranscriptionService from
archived_projects/personal-ai-assistant-v1.1.0/src/services/transcription_service.py - Remove podcast-specific database dependencies (PodcastEpisode, PodcastTranscript, Repository)
- Adapt for YouTube video context (transcribe_episode → transcribe_audio_file)
- Create WhisperTranscriptService wrapper class
- Add async compatibility with asyncio.run_in_executor()
- Test basic transcription functionality with sample audio
Task 1.2: Update Dependencies and Environment (2 hours)
- Add to backend/requirements.txt: openai-whisper, torch, librosa, pydub, soundfile
- Update Dockerfile with ffmpeg system dependency
- Test Whisper model download (small model ~244MB)
- Configure environment variables (WHISPER_MODEL_SIZE, WHISPER_DEVICE)
- Verify CUDA detection works (if available)
Task 1.3: Replace MockWhisperService (3 hours)
- Update EnhancedTranscriptService to use real WhisperTranscriptService
- Remove MockWhisperService references throughout codebase
- Update dependency injection in main.py and service factories
- Test integration with existing VideoDownloadService audio extraction
- Verify async compatibility with existing pipeline architecture
Phase 2: API Enhancement (4 hours)
Task 2.1: Create DualTranscriptService (2 hours)
- Create DualTranscriptService class extending EnhancedTranscriptService
- Implement extract_dual_transcripts() with parallel YouTube/Whisper processing
- Add quality comparison algorithm (word-level diff, confidence scoring)
- Implement separate caching strategy for YouTube vs Whisper results
- Test parallel processing performance and error handling
Task 2.2: Add New API Endpoints (2 hours)
- Create TranscriptOptionsRequest model (source, whisper_model, language, timestamps)
- Add /api/transcripts/dual/{video_id} endpoint for dual transcript processing
- Update existing pipeline to accept transcript source preference
- Add TranscriptComparisonResponse model for side-by-side results
- Test all endpoint variations (youtube, whisper, both)
Phase 3: Database Schema (2 hours)
Task 3.1: Extend Summary Model (1 hour)
- Create Alembic migration for new transcript fields
- Add columns: transcript_source, transcript_quality_score, youtube_transcript, whisper_transcript, whisper_processing_time, transcript_comparison_data
- Update Summary model class with new fields
- Update repository methods for transcript metadata storage
Task 3.2: Add Performance Indexes (1 hour)
- Create indexes on transcript_source, quality_score, processing_time
- Test query performance with sample data
- Verify migration runs successfully on development database
Phase 4: Frontend Implementation (6 hours)
Task 4.1: Create TranscriptSelector Component (2 hours)
- Create TranscriptSelector.tsx with Radix UI RadioGroup
- Add processing time estimation based on video duration
- Implement visual indicators (icons, quality badges, time estimates)
- Add proper TypeScript interfaces and accessibility labels
- Test responsive design and keyboard navigation
Task 4.2: Integrate with SummarizeForm (1 hour)
- Add transcript source state to SummarizeForm component
- Update form submission to include transcript options
- Add form validation for transcript source selection
- Test integration with existing admin page workflow
Task 4.3: Create TranscriptComparison Component (2 hours)
- Create TranscriptComparison.tsx for side-by-side display
- Implement word-level difference highlighting algorithm
- Add quality metric badges and comparison controls
- Create transcript selection buttons (Use YouTube/Use Whisper)
- Test with real YouTube vs Whisper transcript differences
Task 4.4: Update Processing UI (1 hour)
- Update ProgressTracker to show transcript source and method
- Add different progress messages for Whisper vs YouTube processing
- Show estimated time remaining for Whisper transcription
- Add error handling for Whisper processing failures with fallback notifications
Phase 5: Testing and Integration (2 hours)
Task 5.1: Unit Tests (1 hour)
- Backend: test_whisper_transcription_accuracy, test_dual_transcript_comparison, test_automatic_fallback
- Frontend: TranscriptSelector component tests, form integration tests
- API: dual transcript endpoint tests, transcript option validation tests
- Verify >80% test coverage for new code
Task 5.2: Integration Testing (1 hour)
- End-to-end test: YouTube vs Whisper quality comparison with real video
- Admin page testing: verify transcript selector works without authentication
- Error scenarios: unavailable YouTube captions (fallback), Whisper processing failure
- Performance testing: benchmark Whisper processing times, verify cache effectiveness
Testing Strategy
Unit Tests
# Test file: backend/tests/unit/test_whisper_transcript_service.py
def test_whisper_transcription_accuracy():
"""Test Whisper transcription produces expected quality"""
def test_dual_transcript_comparison():
"""Test quality comparison algorithm"""
def test_automatic_fallback():
"""Test fallback when YouTube captions unavailable"""
Integration Tests
# Test file: backend/tests/integration/test_dual_transcript_api.py
async def test_dual_transcript_endpoint():
"""Test dual transcript API with real video"""
async def test_transcript_source_selection():
"""Test different source options return expected results"""
User Acceptance Testing
- Test transcript quality comparison with 5 different video types
- Verify processing time estimates are accurate within 20%
- Confirm automatic fallback works for videos without captions
- Validate quality metrics correlate with perceived accuracy
Dependencies and Prerequisites
External Dependencies
openai-whisper: Core Whisper transcription capabilitytorch: PyTorch for Whisper model executionlibrosa: Audio processing and analysispydub: Audio format conversion and manipulationffmpeg: Audio/video processing (system dependency)
Internal Dependencies
- Epic 3 Complete: User authentication and session management
- VideoDownloadService: Required for audio extraction
- EnhancedTranscriptService: Base service for integration
- WebSocket infrastructure: For real-time progress updates
Hardware Requirements
- CPU: Multi-core processor (Whisper is CPU-intensive)
- Memory: 4GB+ RAM for "small" model, 8GB+ for "medium/large"
- Storage: Additional space for downloaded audio files
- GPU: Optional but recommended (CUDA support for faster processing)
Performance Considerations
Whisper Model Selection
- tiny: Fastest, lowest accuracy (39 MB)
- base: Good balance (74 MB)
- small: Recommended default (244 MB) ⭐
- medium: Higher accuracy (769 MB)
- large: Best accuracy (1550 MB)
Optimization Strategies
- Use "small" model by default for balance of speed/accuracy
- Implement model caching to avoid reloading
- Add GPU detection and automatic CUDA usage
- Chunk long videos to prevent memory issues
- Cache Whisper results aggressively (longer TTL than YouTube captions)
Risk Mitigation
High Risk Items
- Processing Time: Whisper can be slow for long videos
- Resource Usage: High CPU/memory consumption
- Model Downloads: Large model files on first run
- Audio Quality: Poor audio affects Whisper accuracy
Mitigation Strategies
- Time Estimates: Clear user expectations and progress indicators
- Resource Monitoring: Implement processing limits and queue management
- Model Management: Pre-download models in Docker image
- Quality Checks: Audio preprocessing and noise reduction
Success Criteria
Definition of Done
- Users can select between YouTube/Whisper/Both transcript options
- Real Whisper transcription integrated from archived codebase
- Processing time estimates accurate within 20%
- Quality comparison shows meaningful differences
- Automatic fallback works when YouTube captions unavailable
- All tests pass with >80% code coverage
- Performance acceptable (Whisper <2 minutes for 10-minute video)
- UI provides clear feedback during processing
- Database properly stores transcript metadata and quality scores
Acceptance Testing Scenarios
- Standard Use Case: Select Whisper for technical video, confirm accuracy improvement
- Comparison Mode: Use "Compare Both" option, review side-by-side differences
- Fallback Scenario: Process video without YouTube captions, verify Whisper fallback
- Long Video: Process 30+ minute video, confirm chunking works properly
- Error Handling: Test with corrupted audio, verify graceful error handling
Post-Implementation Considerations
Monitoring and Analytics
- Track transcript source usage patterns
- Monitor Whisper processing times and success rates
- Collect user feedback on transcript quality satisfaction
- Analyze cost implications of increased Whisper usage
Future Enhancements
- Custom Whisper model fine-tuning for specific domains
- Speaker identification integration
- Real-time transcription for live streams
- Multi-language Whisper support beyond English
Story Owner: Development Team
Reviewer: Technical Lead
Epic Reference: Epic 4 - Advanced Intelligence & Developer Platform
Story Status: Ready for Implementation
Last Updated: 2025-08-27