510 lines
20 KiB
Markdown
510 lines
20 KiB
Markdown
# Story 4.1: Dual Transcript Options (YouTube + Whisper)
|
|
|
|
## Story Overview
|
|
|
|
**As a** user processing YouTube videos
|
|
**I want** to choose between YouTube captions and AI Whisper transcription
|
|
**So that** I can control the trade-off between speed/cost and transcription accuracy
|
|
|
|
**Story ID**: 4.1
|
|
**Epic**: Epic 4 - Advanced Intelligence & Developer Platform
|
|
**Priority**: High 🔥
|
|
**Effort Estimate**: 22 hours
|
|
**Sprint**: Epic 4, Sprint 1
|
|
|
|
## Business Value
|
|
|
|
### Problem Statement
|
|
Current YouTube Summarizer relies solely on YouTube's captions, which often have quality issues:
|
|
- Poor punctuation and capitalization
|
|
- Incorrect technical terms and proper nouns
|
|
- Missing speaker identification
|
|
- Timing inaccuracies with rapid speech
|
|
- Complete unavailability for some videos
|
|
|
|
Users need the option to choose higher-accuracy AI transcription when quality matters more than speed.
|
|
|
|
### Solution Value
|
|
- **User Control**: Choose optimal transcript source for each use case
|
|
- **Quality Flexibility**: Fast YouTube captions vs accurate AI transcription
|
|
- **Fallback Reliability**: Automatic Whisper backup when YouTube captions unavailable
|
|
- **Transparency**: Clear cost/time implications for informed decisions
|
|
- **Leveraged Assets**: Reuse proven TranscriptionService from archived projects
|
|
|
|
### Success Metrics
|
|
- [ ] 80%+ users understand transcript option differences
|
|
- [ ] 30%+ of users try Whisper transcription option
|
|
- [ ] 25%+ improvement in transcript accuracy using Whisper
|
|
- [ ] <5% user complaints about transcript quality
|
|
- [ ] Zero failed transcriptions due to unavailable YouTube captions
|
|
|
|
## Acceptance Criteria
|
|
|
|
### AC 1: Transcript Source Selection UI
|
|
**Given** I am processing a YouTube video
|
|
**When** I access the transcript options
|
|
**Then** I see three clear choices:
|
|
- 📺 **YouTube Captions** (Fast, Free, 2-5 seconds)
|
|
- 🎯 **AI Whisper** (Slower, Higher Quality, 30-120 seconds)
|
|
- 🔄 **Compare Both** (Best Quality, Shows differences)
|
|
|
|
**And** each option shows estimated processing time and quality level
|
|
|
|
### AC 2: YouTube Transcript Processing (Default)
|
|
**Given** I select "YouTube Captions" option
|
|
**When** I submit a video URL
|
|
**Then** the system extracts transcript using existing YouTube API methods
|
|
**And** processing completes in under 5 seconds
|
|
**And** the transcript source is marked as "youtube" in database
|
|
**And** quality score is calculated based on caption availability
|
|
|
|
### AC 3: Whisper Transcript Processing
|
|
**Given** I select "AI Whisper" option
|
|
**When** I submit a video URL
|
|
**Then** the system downloads video audio using existing VideoDownloadService
|
|
**And** processes audio through integrated TranscriptionService
|
|
**And** returns high-quality transcript with timestamps
|
|
**And** the transcript source is marked as "whisper" in database
|
|
**And** processing time is clearly communicated to user
|
|
|
|
### AC 4: Dual Transcript Comparison
|
|
**Given** I select "Compare Both" option
|
|
**When** processing completes
|
|
**Then** I see side-by-side transcript comparison
|
|
**And** differences are highlighted (word accuracy, punctuation, technical terms)
|
|
**And** quality metrics are shown for each transcript
|
|
**And** I can switch between transcripts for summary generation
|
|
|
|
### AC 5: Automatic Fallback
|
|
**Given** YouTube captions are unavailable for a video
|
|
**When** I select "YouTube Captions" option
|
|
**Then** system automatically falls back to Whisper transcription
|
|
**And** user is notified of the fallback with processing time estimate
|
|
**And** final result shows "whisper" as the source method
|
|
|
|
### AC 6: Quality and Cost Transparency
|
|
**Given** I'm choosing transcript options
|
|
**When** I view the selection interface
|
|
**Then** I see clear indicators:
|
|
- Processing time estimates (YouTube: 2-5s, Whisper: 30-120s)
|
|
- Quality indicators (YouTube: "Standard", Whisper: "High Accuracy")
|
|
- Availability status (YouTube: "May not be available", Whisper: "Always available")
|
|
- Cost implications (YouTube: "Free", Whisper: "Uses compute resources")
|
|
|
|
## Technical Implementation
|
|
|
|
### Architecture Components
|
|
|
|
#### 1. Real Whisper Service Integration
|
|
```python
|
|
# File: backend/services/whisper_transcript_service.py
|
|
class WhisperTranscriptService:
|
|
"""Real Whisper service replacing MockWhisperService"""
|
|
|
|
def __init__(self, model_size: str = "small"):
|
|
# Copy proven TranscriptionService from archived project
|
|
self.transcription_service = TranscriptionService(
|
|
repository=None, # YouTube context doesn't need episode storage
|
|
model_size=model_size,
|
|
device="auto"
|
|
)
|
|
|
|
async def transcribe_audio(self, audio_path: Path) -> Dict[str, Any]:
|
|
"""Use proven Whisper implementation from archived project"""
|
|
segments = await self.transcription_service._transcribe_audio_file(
|
|
str(audio_path)
|
|
)
|
|
return self._convert_to_api_format(segments)
|
|
```
|
|
|
|
#### 2. Enhanced Transcript Service
|
|
```python
|
|
# File: backend/services/enhanced_transcript_service.py
|
|
class DualTranscriptService(EnhancedTranscriptService):
|
|
"""Enhanced service with real Whisper integration"""
|
|
|
|
async def extract_dual_transcripts(
|
|
self, video_id: str
|
|
) -> Dict[str, TranscriptResult]:
|
|
"""Extract both YouTube and Whisper transcripts"""
|
|
|
|
results = {}
|
|
|
|
# YouTube transcript (parallel processing)
|
|
youtube_task = asyncio.create_task(
|
|
self._extract_youtube_transcript(video_id)
|
|
)
|
|
|
|
# Whisper transcript (parallel processing)
|
|
whisper_task = asyncio.create_task(
|
|
self._extract_whisper_transcript(video_id)
|
|
)
|
|
|
|
youtube_result, whisper_result = await asyncio.gather(
|
|
youtube_task, whisper_task, return_exceptions=True
|
|
)
|
|
|
|
return {
|
|
'youtube': youtube_result if not isinstance(youtube_result, Exception) else None,
|
|
'whisper': whisper_result if not isinstance(whisper_result, Exception) else None
|
|
}
|
|
```
|
|
|
|
#### 3. Frontend Transcript Selector
|
|
```tsx
|
|
// File: frontend/src/components/TranscriptSelector.tsx
|
|
interface TranscriptSelectorProps {
|
|
onSourceChange: (source: TranscriptSource) => void;
|
|
estimatedDuration?: number;
|
|
}
|
|
|
|
const TranscriptSelector = ({ onSourceChange, estimatedDuration }: TranscriptSelectorProps) => {
|
|
return (
|
|
<div className="transcript-selector">
|
|
<RadioGroup onValueChange={onSourceChange}>
|
|
<div className="transcript-option">
|
|
<RadioGroupItem value="youtube" />
|
|
<Label>
|
|
📺 YouTube Captions
|
|
<span className="option-details">Fast, Free • 2-5 seconds</span>
|
|
<span className="quality-indicator">Standard Quality</span>
|
|
</Label>
|
|
</div>
|
|
|
|
<div className="transcript-option">
|
|
<RadioGroupItem value="whisper" />
|
|
<Label>
|
|
🎯 AI Whisper Transcription
|
|
<span className="option-details">
|
|
Higher Accuracy • {estimateWhisperTime(estimatedDuration)}
|
|
</span>
|
|
<span className="quality-indicator">High Accuracy</span>
|
|
</Label>
|
|
</div>
|
|
|
|
<div className="transcript-option">
|
|
<RadioGroupItem value="both" />
|
|
<Label>
|
|
🔄 Compare Both
|
|
<span className="option-details">Best Quality • See differences</span>
|
|
<span className="quality-indicator">Quality Comparison</span>
|
|
</Label>
|
|
</div>
|
|
</RadioGroup>
|
|
</div>
|
|
);
|
|
};
|
|
```
|
|
|
|
#### 4. Transcript Comparison UI
|
|
```tsx
|
|
// File: frontend/src/components/TranscriptComparison.tsx
|
|
const TranscriptComparison = ({ youtubeTranscript, whisperTranscript }) => {
|
|
return (
|
|
<div className="transcript-comparison">
|
|
<div className="comparison-header">
|
|
<h3>Transcript Quality Comparison</h3>
|
|
<div className="quality-metrics">
|
|
<QualityBadge source="youtube" score={youtubeTranscript.qualityScore} />
|
|
<QualityBadge source="whisper" score={whisperTranscript.qualityScore} />
|
|
</div>
|
|
</div>
|
|
|
|
<div className="side-by-side-comparison">
|
|
<TranscriptPanel
|
|
title="📺 YouTube Captions"
|
|
transcript={youtubeTranscript}
|
|
highlightDifferences={true}
|
|
/>
|
|
<TranscriptPanel
|
|
title="🎯 AI Whisper"
|
|
transcript={whisperTranscript}
|
|
highlightDifferences={true}
|
|
/>
|
|
</div>
|
|
|
|
<div className="comparison-controls">
|
|
<Button onClick={() => selectTranscript('youtube')}>
|
|
Use YouTube Transcript
|
|
</Button>
|
|
<Button onClick={() => selectTranscript('whisper')}>
|
|
Use Whisper Transcript
|
|
</Button>
|
|
</div>
|
|
</div>
|
|
);
|
|
};
|
|
```
|
|
|
|
### Database Schema Changes
|
|
|
|
```sql
|
|
-- Extend summaries table for transcript metadata
|
|
ALTER TABLE summaries
|
|
ADD COLUMN transcript_source VARCHAR(20), -- 'youtube', 'whisper', 'both'
|
|
ADD COLUMN transcript_quality_score FLOAT,
|
|
ADD COLUMN youtube_transcript TEXT,
|
|
ADD COLUMN whisper_transcript TEXT,
|
|
ADD COLUMN whisper_processing_time FLOAT,
|
|
ADD COLUMN transcript_comparison_data JSON;
|
|
|
|
-- Create index for transcript source queries
|
|
CREATE INDEX idx_summaries_transcript_source ON summaries(transcript_source);
|
|
```
|
|
|
|
### API Endpoint Changes
|
|
|
|
```python
|
|
# File: backend/api/transcript.py
|
|
@router.post("/api/transcripts/dual/{video_id}")
|
|
async def get_dual_transcripts(
|
|
video_id: str,
|
|
options: TranscriptOptionsRequest,
|
|
background_tasks: BackgroundTasks,
|
|
current_user: User = Depends(get_current_user)
|
|
):
|
|
"""Get transcripts with specified source option"""
|
|
|
|
if options.source == "both":
|
|
results = await dual_transcript_service.extract_dual_transcripts(video_id)
|
|
return TranscriptComparisonResponse(
|
|
video_id=video_id,
|
|
transcripts=results,
|
|
quality_comparison=analyze_transcript_differences(results)
|
|
)
|
|
|
|
elif options.source == "whisper":
|
|
transcript = await dual_transcript_service.extract_transcript(
|
|
video_id,
|
|
force_whisper=True
|
|
)
|
|
return TranscriptResponse(video_id=video_id, transcript=transcript)
|
|
|
|
else: # youtube (default)
|
|
transcript = await dual_transcript_service.extract_transcript(
|
|
video_id,
|
|
force_whisper=False
|
|
)
|
|
return TranscriptResponse(video_id=video_id, transcript=transcript)
|
|
```
|
|
|
|
## Development Tasks
|
|
|
|
### Phase 1: Backend Foundation (8 hours)
|
|
|
|
#### Task 1.1: Copy and Adapt TranscriptionService (3 hours)
|
|
- [x] Copy TranscriptionService from `archived_projects/personal-ai-assistant-v1.1.0/src/services/transcription_service.py`
|
|
- [ ] Remove podcast-specific database dependencies (PodcastEpisode, PodcastTranscript, Repository)
|
|
- [ ] Adapt for YouTube video context (transcribe_episode → transcribe_audio_file)
|
|
- [ ] Create WhisperTranscriptService wrapper class
|
|
- [ ] Add async compatibility with asyncio.run_in_executor()
|
|
- [ ] Test basic transcription functionality with sample audio
|
|
|
|
#### Task 1.2: Update Dependencies and Environment (2 hours)
|
|
- [ ] Add to backend/requirements.txt: openai-whisper, torch, librosa, pydub, soundfile
|
|
- [ ] Update Dockerfile with ffmpeg system dependency
|
|
- [ ] Test Whisper model download (small model ~244MB)
|
|
- [ ] Configure environment variables (WHISPER_MODEL_SIZE, WHISPER_DEVICE)
|
|
- [ ] Verify CUDA detection works (if available)
|
|
|
|
#### Task 1.3: Replace MockWhisperService (3 hours)
|
|
- [ ] Update EnhancedTranscriptService to use real WhisperTranscriptService
|
|
- [ ] Remove MockWhisperService references throughout codebase
|
|
- [ ] Update dependency injection in main.py and service factories
|
|
- [ ] Test integration with existing VideoDownloadService audio extraction
|
|
- [ ] Verify async compatibility with existing pipeline architecture
|
|
|
|
### Phase 2: API Enhancement (4 hours)
|
|
|
|
#### Task 2.1: Create DualTranscriptService (2 hours)
|
|
- [ ] Create DualTranscriptService class extending EnhancedTranscriptService
|
|
- [ ] Implement extract_dual_transcripts() with parallel YouTube/Whisper processing
|
|
- [ ] Add quality comparison algorithm (word-level diff, confidence scoring)
|
|
- [ ] Implement separate caching strategy for YouTube vs Whisper results
|
|
- [ ] Test parallel processing performance and error handling
|
|
|
|
#### Task 2.2: Add New API Endpoints (2 hours)
|
|
- [ ] Create TranscriptOptionsRequest model (source, whisper_model, language, timestamps)
|
|
- [ ] Add /api/transcripts/dual/{video_id} endpoint for dual transcript processing
|
|
- [ ] Update existing pipeline to accept transcript source preference
|
|
- [ ] Add TranscriptComparisonResponse model for side-by-side results
|
|
- [ ] Test all endpoint variations (youtube, whisper, both)
|
|
|
|
### Phase 3: Database Schema (2 hours)
|
|
|
|
#### Task 3.1: Extend Summary Model (1 hour)
|
|
- [ ] Create Alembic migration for new transcript fields
|
|
- [ ] Add columns: transcript_source, transcript_quality_score, youtube_transcript, whisper_transcript, whisper_processing_time, transcript_comparison_data
|
|
- [ ] Update Summary model class with new fields
|
|
- [ ] Update repository methods for transcript metadata storage
|
|
|
|
#### Task 3.2: Add Performance Indexes (1 hour)
|
|
- [ ] Create indexes on transcript_source, quality_score, processing_time
|
|
- [ ] Test query performance with sample data
|
|
- [ ] Verify migration runs successfully on development database
|
|
|
|
### Phase 4: Frontend Implementation (6 hours)
|
|
|
|
#### Task 4.1: Create TranscriptSelector Component (2 hours)
|
|
- [ ] Create TranscriptSelector.tsx with Radix UI RadioGroup
|
|
- [ ] Add processing time estimation based on video duration
|
|
- [ ] Implement visual indicators (icons, quality badges, time estimates)
|
|
- [ ] Add proper TypeScript interfaces and accessibility labels
|
|
- [ ] Test responsive design and keyboard navigation
|
|
|
|
#### Task 4.2: Integrate with SummarizeForm (1 hour)
|
|
- [ ] Add transcript source state to SummarizeForm component
|
|
- [ ] Update form submission to include transcript options
|
|
- [ ] Add form validation for transcript source selection
|
|
- [ ] Test integration with existing admin page workflow
|
|
|
|
#### Task 4.3: Create TranscriptComparison Component (2 hours)
|
|
- [ ] Create TranscriptComparison.tsx for side-by-side display
|
|
- [ ] Implement word-level difference highlighting algorithm
|
|
- [ ] Add quality metric badges and comparison controls
|
|
- [ ] Create transcript selection buttons (Use YouTube/Use Whisper)
|
|
- [ ] Test with real YouTube vs Whisper transcript differences
|
|
|
|
#### Task 4.4: Update Processing UI (1 hour)
|
|
- [ ] Update ProgressTracker to show transcript source and method
|
|
- [ ] Add different progress messages for Whisper vs YouTube processing
|
|
- [ ] Show estimated time remaining for Whisper transcription
|
|
- [ ] Add error handling for Whisper processing failures with fallback notifications
|
|
|
|
### Phase 5: Testing and Integration (2 hours)
|
|
|
|
#### Task 5.1: Unit Tests (1 hour)
|
|
- [ ] Backend: test_whisper_transcription_accuracy, test_dual_transcript_comparison, test_automatic_fallback
|
|
- [ ] Frontend: TranscriptSelector component tests, form integration tests
|
|
- [ ] API: dual transcript endpoint tests, transcript option validation tests
|
|
- [ ] Verify >80% test coverage for new code
|
|
|
|
#### Task 5.2: Integration Testing (1 hour)
|
|
- [ ] End-to-end test: YouTube vs Whisper quality comparison with real video
|
|
- [ ] Admin page testing: verify transcript selector works without authentication
|
|
- [ ] Error scenarios: unavailable YouTube captions (fallback), Whisper processing failure
|
|
- [ ] Performance testing: benchmark Whisper processing times, verify cache effectiveness
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests
|
|
```python
|
|
# Test file: backend/tests/unit/test_whisper_transcript_service.py
|
|
def test_whisper_transcription_accuracy():
|
|
"""Test Whisper transcription produces expected quality"""
|
|
|
|
def test_dual_transcript_comparison():
|
|
"""Test quality comparison algorithm"""
|
|
|
|
def test_automatic_fallback():
|
|
"""Test fallback when YouTube captions unavailable"""
|
|
```
|
|
|
|
### Integration Tests
|
|
```python
|
|
# Test file: backend/tests/integration/test_dual_transcript_api.py
|
|
async def test_dual_transcript_endpoint():
|
|
"""Test dual transcript API with real video"""
|
|
|
|
async def test_transcript_source_selection():
|
|
"""Test different source options return expected results"""
|
|
```
|
|
|
|
### User Acceptance Testing
|
|
- [ ] Test transcript quality comparison with 5 different video types
|
|
- [ ] Verify processing time estimates are accurate within 20%
|
|
- [ ] Confirm automatic fallback works for videos without captions
|
|
- [ ] Validate quality metrics correlate with perceived accuracy
|
|
|
|
## Dependencies and Prerequisites
|
|
|
|
### External Dependencies
|
|
- `openai-whisper`: Core Whisper transcription capability
|
|
- `torch`: PyTorch for Whisper model execution
|
|
- `librosa`: Audio processing and analysis
|
|
- `pydub`: Audio format conversion and manipulation
|
|
- `ffmpeg`: Audio/video processing (system dependency)
|
|
|
|
### Internal Dependencies
|
|
- Epic 3 Complete: User authentication and session management
|
|
- VideoDownloadService: Required for audio extraction
|
|
- EnhancedTranscriptService: Base service for integration
|
|
- WebSocket infrastructure: For real-time progress updates
|
|
|
|
### Hardware Requirements
|
|
- **CPU**: Multi-core processor (Whisper is CPU-intensive)
|
|
- **Memory**: 4GB+ RAM for "small" model, 8GB+ for "medium/large"
|
|
- **Storage**: Additional space for downloaded audio files
|
|
- **GPU**: Optional but recommended (CUDA support for faster processing)
|
|
|
|
## Performance Considerations
|
|
|
|
### Whisper Model Selection
|
|
- **tiny**: Fastest, lowest accuracy (39 MB)
|
|
- **base**: Good balance (74 MB)
|
|
- **small**: Recommended default (244 MB) ⭐
|
|
- **medium**: Higher accuracy (769 MB)
|
|
- **large**: Best accuracy (1550 MB)
|
|
|
|
### Optimization Strategies
|
|
- Use "small" model by default for balance of speed/accuracy
|
|
- Implement model caching to avoid reloading
|
|
- Add GPU detection and automatic CUDA usage
|
|
- Chunk long videos to prevent memory issues
|
|
- Cache Whisper results aggressively (longer TTL than YouTube captions)
|
|
|
|
## Risk Mitigation
|
|
|
|
### High Risk Items
|
|
1. **Processing Time**: Whisper can be slow for long videos
|
|
2. **Resource Usage**: High CPU/memory consumption
|
|
3. **Model Downloads**: Large model files on first run
|
|
4. **Audio Quality**: Poor audio affects Whisper accuracy
|
|
|
|
### Mitigation Strategies
|
|
1. **Time Estimates**: Clear user expectations and progress indicators
|
|
2. **Resource Monitoring**: Implement processing limits and queue management
|
|
3. **Model Management**: Pre-download models in Docker image
|
|
4. **Quality Checks**: Audio preprocessing and noise reduction
|
|
|
|
## Success Criteria
|
|
|
|
### Definition of Done
|
|
- [ ] Users can select between YouTube/Whisper/Both transcript options
|
|
- [ ] Real Whisper transcription integrated from archived codebase
|
|
- [ ] Processing time estimates accurate within 20%
|
|
- [ ] Quality comparison shows meaningful differences
|
|
- [ ] Automatic fallback works when YouTube captions unavailable
|
|
- [ ] All tests pass with >80% code coverage
|
|
- [ ] Performance acceptable (Whisper <2 minutes for 10-minute video)
|
|
- [ ] UI provides clear feedback during processing
|
|
- [ ] Database properly stores transcript metadata and quality scores
|
|
|
|
### Acceptance Testing Scenarios
|
|
1. **Standard Use Case**: Select Whisper for technical video, confirm accuracy improvement
|
|
2. **Comparison Mode**: Use "Compare Both" option, review side-by-side differences
|
|
3. **Fallback Scenario**: Process video without YouTube captions, verify Whisper fallback
|
|
4. **Long Video**: Process 30+ minute video, confirm chunking works properly
|
|
5. **Error Handling**: Test with corrupted audio, verify graceful error handling
|
|
|
|
## Post-Implementation Considerations
|
|
|
|
### Monitoring and Analytics
|
|
- Track transcript source usage patterns
|
|
- Monitor Whisper processing times and success rates
|
|
- Collect user feedback on transcript quality satisfaction
|
|
- Analyze cost implications of increased Whisper usage
|
|
|
|
### Future Enhancements
|
|
- Custom Whisper model fine-tuning for specific domains
|
|
- Speaker identification integration
|
|
- Real-time transcription for live streams
|
|
- Multi-language Whisper support beyond English
|
|
|
|
---
|
|
|
|
**Story Owner**: Development Team
|
|
**Reviewer**: Technical Lead
|
|
**Epic Reference**: Epic 4 - Advanced Intelligence & Developer Platform
|
|
**Story Status**: Ready for Implementation
|
|
**Last Updated**: 2025-08-27 |