youtube-summarizer/docs/stories/4.1.dual-transcript-options.md

510 lines
20 KiB
Markdown

# Story 4.1: Dual Transcript Options (YouTube + Whisper)
## Story Overview
**As a** user processing YouTube videos
**I want** to choose between YouTube captions and AI Whisper transcription
**So that** I can control the trade-off between speed/cost and transcription accuracy
**Story ID**: 4.1
**Epic**: Epic 4 - Advanced Intelligence & Developer Platform
**Priority**: High 🔥
**Effort Estimate**: 22 hours
**Sprint**: Epic 4, Sprint 1
## Business Value
### Problem Statement
Current YouTube Summarizer relies solely on YouTube's captions, which often have quality issues:
- Poor punctuation and capitalization
- Incorrect technical terms and proper nouns
- Missing speaker identification
- Timing inaccuracies with rapid speech
- Complete unavailability for some videos
Users need the option to choose higher-accuracy AI transcription when quality matters more than speed.
### Solution Value
- **User Control**: Choose optimal transcript source for each use case
- **Quality Flexibility**: Fast YouTube captions vs accurate AI transcription
- **Fallback Reliability**: Automatic Whisper backup when YouTube captions unavailable
- **Transparency**: Clear cost/time implications for informed decisions
- **Leveraged Assets**: Reuse proven TranscriptionService from archived projects
### Success Metrics
- [ ] 80%+ users understand transcript option differences
- [ ] 30%+ of users try Whisper transcription option
- [ ] 25%+ improvement in transcript accuracy using Whisper
- [ ] <5% user complaints about transcript quality
- [ ] Zero failed transcriptions due to unavailable YouTube captions
## Acceptance Criteria
### AC 1: Transcript Source Selection UI
**Given** I am processing a YouTube video
**When** I access the transcript options
**Then** I see three clear choices:
- 📺 **YouTube Captions** (Fast, Free, 2-5 seconds)
- 🎯 **AI Whisper** (Slower, Higher Quality, 30-120 seconds)
- 🔄 **Compare Both** (Best Quality, Shows differences)
**And** each option shows estimated processing time and quality level
### AC 2: YouTube Transcript Processing (Default)
**Given** I select "YouTube Captions" option
**When** I submit a video URL
**Then** the system extracts transcript using existing YouTube API methods
**And** processing completes in under 5 seconds
**And** the transcript source is marked as "youtube" in database
**And** quality score is calculated based on caption availability
### AC 3: Whisper Transcript Processing
**Given** I select "AI Whisper" option
**When** I submit a video URL
**Then** the system downloads video audio using existing VideoDownloadService
**And** processes audio through integrated TranscriptionService
**And** returns high-quality transcript with timestamps
**And** the transcript source is marked as "whisper" in database
**And** processing time is clearly communicated to user
### AC 4: Dual Transcript Comparison
**Given** I select "Compare Both" option
**When** processing completes
**Then** I see side-by-side transcript comparison
**And** differences are highlighted (word accuracy, punctuation, technical terms)
**And** quality metrics are shown for each transcript
**And** I can switch between transcripts for summary generation
### AC 5: Automatic Fallback
**Given** YouTube captions are unavailable for a video
**When** I select "YouTube Captions" option
**Then** system automatically falls back to Whisper transcription
**And** user is notified of the fallback with processing time estimate
**And** final result shows "whisper" as the source method
### AC 6: Quality and Cost Transparency
**Given** I'm choosing transcript options
**When** I view the selection interface
**Then** I see clear indicators:
- Processing time estimates (YouTube: 2-5s, Whisper: 30-120s)
- Quality indicators (YouTube: "Standard", Whisper: "High Accuracy")
- Availability status (YouTube: "May not be available", Whisper: "Always available")
- Cost implications (YouTube: "Free", Whisper: "Uses compute resources")
## Technical Implementation
### Architecture Components
#### 1. Real Whisper Service Integration
```python
# File: backend/services/whisper_transcript_service.py
class WhisperTranscriptService:
"""Real Whisper service replacing MockWhisperService"""
def __init__(self, model_size: str = "small"):
# Copy proven TranscriptionService from archived project
self.transcription_service = TranscriptionService(
repository=None, # YouTube context doesn't need episode storage
model_size=model_size,
device="auto"
)
async def transcribe_audio(self, audio_path: Path) -> Dict[str, Any]:
"""Use proven Whisper implementation from archived project"""
segments = await self.transcription_service._transcribe_audio_file(
str(audio_path)
)
return self._convert_to_api_format(segments)
```
#### 2. Enhanced Transcript Service
```python
# File: backend/services/enhanced_transcript_service.py
class DualTranscriptService(EnhancedTranscriptService):
"""Enhanced service with real Whisper integration"""
async def extract_dual_transcripts(
self, video_id: str
) -> Dict[str, TranscriptResult]:
"""Extract both YouTube and Whisper transcripts"""
results = {}
# YouTube transcript (parallel processing)
youtube_task = asyncio.create_task(
self._extract_youtube_transcript(video_id)
)
# Whisper transcript (parallel processing)
whisper_task = asyncio.create_task(
self._extract_whisper_transcript(video_id)
)
youtube_result, whisper_result = await asyncio.gather(
youtube_task, whisper_task, return_exceptions=True
)
return {
'youtube': youtube_result if not isinstance(youtube_result, Exception) else None,
'whisper': whisper_result if not isinstance(whisper_result, Exception) else None
}
```
#### 3. Frontend Transcript Selector
```tsx
// File: frontend/src/components/TranscriptSelector.tsx
interface TranscriptSelectorProps {
onSourceChange: (source: TranscriptSource) => void;
estimatedDuration?: number;
}
const TranscriptSelector = ({ onSourceChange, estimatedDuration }: TranscriptSelectorProps) => {
return (
<div className="transcript-selector">
<RadioGroup onValueChange={onSourceChange}>
<div className="transcript-option">
<RadioGroupItem value="youtube" />
<Label>
📺 YouTube Captions
<span className="option-details">Fast, Free 2-5 seconds</span>
<span className="quality-indicator">Standard Quality</span>
</Label>
</div>
<div className="transcript-option">
<RadioGroupItem value="whisper" />
<Label>
🎯 AI Whisper Transcription
<span className="option-details">
Higher Accuracy {estimateWhisperTime(estimatedDuration)}
</span>
<span className="quality-indicator">High Accuracy</span>
</Label>
</div>
<div className="transcript-option">
<RadioGroupItem value="both" />
<Label>
🔄 Compare Both
<span className="option-details">Best Quality See differences</span>
<span className="quality-indicator">Quality Comparison</span>
</Label>
</div>
</RadioGroup>
</div>
);
};
```
#### 4. Transcript Comparison UI
```tsx
// File: frontend/src/components/TranscriptComparison.tsx
const TranscriptComparison = ({ youtubeTranscript, whisperTranscript }) => {
return (
<div className="transcript-comparison">
<div className="comparison-header">
<h3>Transcript Quality Comparison</h3>
<div className="quality-metrics">
<QualityBadge source="youtube" score={youtubeTranscript.qualityScore} />
<QualityBadge source="whisper" score={whisperTranscript.qualityScore} />
</div>
</div>
<div className="side-by-side-comparison">
<TranscriptPanel
title="📺 YouTube Captions"
transcript={youtubeTranscript}
highlightDifferences={true}
/>
<TranscriptPanel
title="🎯 AI Whisper"
transcript={whisperTranscript}
highlightDifferences={true}
/>
</div>
<div className="comparison-controls">
<Button onClick={() => selectTranscript('youtube')}>
Use YouTube Transcript
</Button>
<Button onClick={() => selectTranscript('whisper')}>
Use Whisper Transcript
</Button>
</div>
</div>
);
};
```
### Database Schema Changes
```sql
-- Extend summaries table for transcript metadata
ALTER TABLE summaries
ADD COLUMN transcript_source VARCHAR(20), -- 'youtube', 'whisper', 'both'
ADD COLUMN transcript_quality_score FLOAT,
ADD COLUMN youtube_transcript TEXT,
ADD COLUMN whisper_transcript TEXT,
ADD COLUMN whisper_processing_time FLOAT,
ADD COLUMN transcript_comparison_data JSON;
-- Create index for transcript source queries
CREATE INDEX idx_summaries_transcript_source ON summaries(transcript_source);
```
### API Endpoint Changes
```python
# File: backend/api/transcript.py
@router.post("/api/transcripts/dual/{video_id}")
async def get_dual_transcripts(
video_id: str,
options: TranscriptOptionsRequest,
background_tasks: BackgroundTasks,
current_user: User = Depends(get_current_user)
):
"""Get transcripts with specified source option"""
if options.source == "both":
results = await dual_transcript_service.extract_dual_transcripts(video_id)
return TranscriptComparisonResponse(
video_id=video_id,
transcripts=results,
quality_comparison=analyze_transcript_differences(results)
)
elif options.source == "whisper":
transcript = await dual_transcript_service.extract_transcript(
video_id,
force_whisper=True
)
return TranscriptResponse(video_id=video_id, transcript=transcript)
else: # youtube (default)
transcript = await dual_transcript_service.extract_transcript(
video_id,
force_whisper=False
)
return TranscriptResponse(video_id=video_id, transcript=transcript)
```
## Development Tasks
### Phase 1: Backend Foundation (8 hours)
#### Task 1.1: Copy and Adapt TranscriptionService (3 hours)
- [x] Copy TranscriptionService from `archived_projects/personal-ai-assistant-v1.1.0/src/services/transcription_service.py`
- [ ] Remove podcast-specific database dependencies (PodcastEpisode, PodcastTranscript, Repository)
- [ ] Adapt for YouTube video context (transcribe_episode transcribe_audio_file)
- [ ] Create WhisperTranscriptService wrapper class
- [ ] Add async compatibility with asyncio.run_in_executor()
- [ ] Test basic transcription functionality with sample audio
#### Task 1.2: Update Dependencies and Environment (2 hours)
- [ ] Add to backend/requirements.txt: openai-whisper, torch, librosa, pydub, soundfile
- [ ] Update Dockerfile with ffmpeg system dependency
- [ ] Test Whisper model download (small model ~244MB)
- [ ] Configure environment variables (WHISPER_MODEL_SIZE, WHISPER_DEVICE)
- [ ] Verify CUDA detection works (if available)
#### Task 1.3: Replace MockWhisperService (3 hours)
- [ ] Update EnhancedTranscriptService to use real WhisperTranscriptService
- [ ] Remove MockWhisperService references throughout codebase
- [ ] Update dependency injection in main.py and service factories
- [ ] Test integration with existing VideoDownloadService audio extraction
- [ ] Verify async compatibility with existing pipeline architecture
### Phase 2: API Enhancement (4 hours)
#### Task 2.1: Create DualTranscriptService (2 hours)
- [ ] Create DualTranscriptService class extending EnhancedTranscriptService
- [ ] Implement extract_dual_transcripts() with parallel YouTube/Whisper processing
- [ ] Add quality comparison algorithm (word-level diff, confidence scoring)
- [ ] Implement separate caching strategy for YouTube vs Whisper results
- [ ] Test parallel processing performance and error handling
#### Task 2.2: Add New API Endpoints (2 hours)
- [ ] Create TranscriptOptionsRequest model (source, whisper_model, language, timestamps)
- [ ] Add /api/transcripts/dual/{video_id} endpoint for dual transcript processing
- [ ] Update existing pipeline to accept transcript source preference
- [ ] Add TranscriptComparisonResponse model for side-by-side results
- [ ] Test all endpoint variations (youtube, whisper, both)
### Phase 3: Database Schema (2 hours)
#### Task 3.1: Extend Summary Model (1 hour)
- [ ] Create Alembic migration for new transcript fields
- [ ] Add columns: transcript_source, transcript_quality_score, youtube_transcript, whisper_transcript, whisper_processing_time, transcript_comparison_data
- [ ] Update Summary model class with new fields
- [ ] Update repository methods for transcript metadata storage
#### Task 3.2: Add Performance Indexes (1 hour)
- [ ] Create indexes on transcript_source, quality_score, processing_time
- [ ] Test query performance with sample data
- [ ] Verify migration runs successfully on development database
### Phase 4: Frontend Implementation (6 hours)
#### Task 4.1: Create TranscriptSelector Component (2 hours)
- [ ] Create TranscriptSelector.tsx with Radix UI RadioGroup
- [ ] Add processing time estimation based on video duration
- [ ] Implement visual indicators (icons, quality badges, time estimates)
- [ ] Add proper TypeScript interfaces and accessibility labels
- [ ] Test responsive design and keyboard navigation
#### Task 4.2: Integrate with SummarizeForm (1 hour)
- [ ] Add transcript source state to SummarizeForm component
- [ ] Update form submission to include transcript options
- [ ] Add form validation for transcript source selection
- [ ] Test integration with existing admin page workflow
#### Task 4.3: Create TranscriptComparison Component (2 hours)
- [ ] Create TranscriptComparison.tsx for side-by-side display
- [ ] Implement word-level difference highlighting algorithm
- [ ] Add quality metric badges and comparison controls
- [ ] Create transcript selection buttons (Use YouTube/Use Whisper)
- [ ] Test with real YouTube vs Whisper transcript differences
#### Task 4.4: Update Processing UI (1 hour)
- [ ] Update ProgressTracker to show transcript source and method
- [ ] Add different progress messages for Whisper vs YouTube processing
- [ ] Show estimated time remaining for Whisper transcription
- [ ] Add error handling for Whisper processing failures with fallback notifications
### Phase 5: Testing and Integration (2 hours)
#### Task 5.1: Unit Tests (1 hour)
- [ ] Backend: test_whisper_transcription_accuracy, test_dual_transcript_comparison, test_automatic_fallback
- [ ] Frontend: TranscriptSelector component tests, form integration tests
- [ ] API: dual transcript endpoint tests, transcript option validation tests
- [ ] Verify >80% test coverage for new code
#### Task 5.2: Integration Testing (1 hour)
- [ ] End-to-end test: YouTube vs Whisper quality comparison with real video
- [ ] Admin page testing: verify transcript selector works without authentication
- [ ] Error scenarios: unavailable YouTube captions (fallback), Whisper processing failure
- [ ] Performance testing: benchmark Whisper processing times, verify cache effectiveness
## Testing Strategy
### Unit Tests
```python
# Test file: backend/tests/unit/test_whisper_transcript_service.py
def test_whisper_transcription_accuracy():
"""Test Whisper transcription produces expected quality"""
def test_dual_transcript_comparison():
"""Test quality comparison algorithm"""
def test_automatic_fallback():
"""Test fallback when YouTube captions unavailable"""
```
### Integration Tests
```python
# Test file: backend/tests/integration/test_dual_transcript_api.py
async def test_dual_transcript_endpoint():
"""Test dual transcript API with real video"""
async def test_transcript_source_selection():
"""Test different source options return expected results"""
```
### User Acceptance Testing
- [ ] Test transcript quality comparison with 5 different video types
- [ ] Verify processing time estimates are accurate within 20%
- [ ] Confirm automatic fallback works for videos without captions
- [ ] Validate quality metrics correlate with perceived accuracy
## Dependencies and Prerequisites
### External Dependencies
- `openai-whisper`: Core Whisper transcription capability
- `torch`: PyTorch for Whisper model execution
- `librosa`: Audio processing and analysis
- `pydub`: Audio format conversion and manipulation
- `ffmpeg`: Audio/video processing (system dependency)
### Internal Dependencies
- Epic 3 Complete: User authentication and session management
- VideoDownloadService: Required for audio extraction
- EnhancedTranscriptService: Base service for integration
- WebSocket infrastructure: For real-time progress updates
### Hardware Requirements
- **CPU**: Multi-core processor (Whisper is CPU-intensive)
- **Memory**: 4GB+ RAM for "small" model, 8GB+ for "medium/large"
- **Storage**: Additional space for downloaded audio files
- **GPU**: Optional but recommended (CUDA support for faster processing)
## Performance Considerations
### Whisper Model Selection
- **tiny**: Fastest, lowest accuracy (39 MB)
- **base**: Good balance (74 MB)
- **small**: Recommended default (244 MB) ⭐
- **medium**: Higher accuracy (769 MB)
- **large**: Best accuracy (1550 MB)
### Optimization Strategies
- Use "small" model by default for balance of speed/accuracy
- Implement model caching to avoid reloading
- Add GPU detection and automatic CUDA usage
- Chunk long videos to prevent memory issues
- Cache Whisper results aggressively (longer TTL than YouTube captions)
## Risk Mitigation
### High Risk Items
1. **Processing Time**: Whisper can be slow for long videos
2. **Resource Usage**: High CPU/memory consumption
3. **Model Downloads**: Large model files on first run
4. **Audio Quality**: Poor audio affects Whisper accuracy
### Mitigation Strategies
1. **Time Estimates**: Clear user expectations and progress indicators
2. **Resource Monitoring**: Implement processing limits and queue management
3. **Model Management**: Pre-download models in Docker image
4. **Quality Checks**: Audio preprocessing and noise reduction
## Success Criteria
### Definition of Done
- [ ] Users can select between YouTube/Whisper/Both transcript options
- [ ] Real Whisper transcription integrated from archived codebase
- [ ] Processing time estimates accurate within 20%
- [ ] Quality comparison shows meaningful differences
- [ ] Automatic fallback works when YouTube captions unavailable
- [ ] All tests pass with >80% code coverage
- [ ] Performance acceptable (Whisper <2 minutes for 10-minute video)
- [ ] UI provides clear feedback during processing
- [ ] Database properly stores transcript metadata and quality scores
### Acceptance Testing Scenarios
1. **Standard Use Case**: Select Whisper for technical video, confirm accuracy improvement
2. **Comparison Mode**: Use "Compare Both" option, review side-by-side differences
3. **Fallback Scenario**: Process video without YouTube captions, verify Whisper fallback
4. **Long Video**: Process 30+ minute video, confirm chunking works properly
5. **Error Handling**: Test with corrupted audio, verify graceful error handling
## Post-Implementation Considerations
### Monitoring and Analytics
- Track transcript source usage patterns
- Monitor Whisper processing times and success rates
- Collect user feedback on transcript quality satisfaction
- Analyze cost implications of increased Whisper usage
### Future Enhancements
- Custom Whisper model fine-tuning for specific domains
- Speaker identification integration
- Real-time transcription for live streams
- Multi-language Whisper support beyond English
---
**Story Owner**: Development Team
**Reviewer**: Technical Lead
**Epic Reference**: Epic 4 - Advanced Intelligence & Developer Platform
**Story Status**: Ready for Implementation
**Last Updated**: 2025-08-27