youtube-summarizer/docs/implementation/STORY_4.1_IMPLEMENTATION_PL...

611 lines
20 KiB
Markdown

# Story 4.1: Dual Transcript Options - Implementation Plan
## Overview
This implementation plan details the step-by-step approach to implement dual transcript options (YouTube + Whisper) in the YouTube Summarizer project, leveraging existing proven Whisper integration from archived projects.
**Story**: 4.1 - Dual Transcript Options (YouTube + Whisper)
**Estimated Effort**: 22 hours
**Priority**: High 🔥
**Target Completion**: 3-4 development days
## Prerequisites Analysis
### Current Codebase Status ✅
**Backend Dependencies Available**:
- FastAPI, Pydantic, SQLAlchemy - Core framework ready
- Anthropic API integration - AI summarization ready
- Enhanced transcript service - Architecture foundation exists
- Video download service - Audio extraction capability ready
- WebSocket infrastructure - Real-time updates ready
**Frontend Dependencies Available**:
- React 18, TypeScript, Vite - Modern frontend ready
- shadcn/ui, Radix UI components - UI components available
- TanStack Query - State management ready
- Admin page infrastructure - Testing environment ready
**Missing Dependencies to Add**:
```bash
# Backend requirements additions
openai-whisper==20231117
torch>=2.0.0
librosa==0.10.1
pydub==0.25.1
soundfile==0.12.1
# System dependency (Docker/local)
ffmpeg
```
### Architecture Integration Points
**Existing Services to Modify**:
1. `backend/services/enhanced_transcript_service.py` - Replace MockWhisperService
2. `backend/api/transcript.py` - Add dual transcript endpoints
3. `frontend/src/components/forms/SummarizeForm.tsx` - Add transcript selector
4. Database models - Add transcript metadata fields
**New Components to Create**:
1. `backend/services/whisper_transcript_service.py` - Real Whisper integration
2. `frontend/src/components/TranscriptSelector.tsx` - Source selection UI
3. `frontend/src/components/TranscriptComparison.tsx` - Quality comparison UI
## Task Breakdown with Time Estimates
### Phase 1: Backend Foundation (8 hours)
#### Task 1.1: Copy and Adapt TranscriptionService (3 hours)
**Objective**: Integrate proven Whisper service from archived project
**Subtasks**:
- [x] **Copy source code** (30 min)
```bash
cp archived_projects/personal-ai-assistant-v1.1.0/src/services/transcription_service.py \
apps/youtube-summarizer/backend/services/whisper_transcript_service.py
```
- [ ] **Remove podcast dependencies** (45 min)
- Remove `PodcastEpisode`, `PodcastTranscript` imports
- Remove repository dependency injection
- Simplify constructor to not require database repository
- [ ] **Adapt for YouTube context** (60 min)
- Update `transcribe_episode()``transcribe_audio_file()`
- Modify segment storage to return data instead of database writes
- Update error handling for YouTube-specific scenarios
- [ ] **Add async compatibility** (45 min)
- Wrap synchronous Whisper calls in asyncio.run_in_executor()
- Update method signatures to async/await pattern
- Test async integration with existing services
**Deliverable**: Working `WhisperTranscriptService` class
#### Task 1.2: Update Dependencies and Environment (2 hours)
**Objective**: Add Whisper dependencies and test environment setup
**Subtasks**:
- [ ] **Update requirements.txt** (15 min)
```bash
# Add to backend/requirements.txt
openai-whisper==20231117
torch>=2.0.0
librosa==0.10.1
pydub==0.25.1
soundfile==0.12.1
```
- [ ] **Update Docker configuration** (45 min)
```dockerfile
# Add to backend/Dockerfile
RUN apt-get update && apt-get install -y ffmpeg
RUN pip install openai-whisper torch librosa pydub soundfile
```
- [ ] **Test Whisper model download** (30 min)
- Test "small" model download (~244MB)
- Verify CUDA detection works (if available)
- Add model caching directory configuration
- [ ] **Environment configuration** (30 min)
```bash
# Add to .env
WHISPER_MODEL_SIZE=small
WHISPER_DEVICE=auto
WHISPER_MODEL_CACHE=/tmp/whisper_models
```
**Deliverable**: Environment ready for Whisper integration
#### Task 1.3: Replace MockWhisperService (3 hours)
**Objective**: Integrate real Whisper service into existing enhanced transcript service
**Subtasks**:
- [ ] **Update EnhancedTranscriptService** (90 min)
```python
# In enhanced_transcript_service.py
from .whisper_transcript_service import WhisperTranscriptService
# Replace MockWhisperService instantiation
self.whisper_service = WhisperTranscriptService(
model_size=os.getenv('WHISPER_MODEL_SIZE', 'small')
)
```
- [ ] **Update dependency injection** (30 min)
- Modify `main.py` service initialization
- Update FastAPI dependency functions
- Ensure proper service lifecycle management
- [ ] **Test integration** (60 min)
- Unit test with sample audio file
- Integration test with video download service
- Verify transcript quality and timing
**Deliverable**: Working Whisper integration in existing service
### Phase 2: API Enhancement (4 hours)
#### Task 2.1: Create Dual Transcript Service (2 hours)
**Objective**: Implement service for handling dual transcript extraction
**Subtasks**:
- [ ] **Create DualTranscriptService class** (60 min)
```python
class DualTranscriptService(EnhancedTranscriptService):
async def extract_dual_transcripts(self, video_id: str) -> Dict[str, TranscriptResult]:
# Parallel processing of YouTube and Whisper
youtube_task = self._extract_youtube_transcript(video_id)
whisper_task = self._extract_whisper_transcript(video_id)
results = await asyncio.gather(
youtube_task, whisper_task, return_exceptions=True
)
return {'youtube': results[0], 'whisper': results[1]}
```
- [ ] **Implement quality comparison** (45 min)
- Word-by-word accuracy comparison algorithm
- Confidence score calculation
- Timing precision analysis
- [ ] **Add caching for dual results** (15 min)
- Cache YouTube and Whisper results separately
- Extended TTL for Whisper (more expensive to regenerate)
**Deliverable**: DualTranscriptService with parallel processing
#### Task 2.2: Add New API Endpoints (2 hours)
**Objective**: Create API endpoints for transcript source selection
**Subtasks**:
- [ ] **Create transcript selection models** (30 min)
```python
class TranscriptOptionsRequest(BaseModel):
source: Literal['youtube', 'whisper', 'both'] = 'youtube'
whisper_model: Literal['tiny', 'base', 'small', 'medium'] = 'small'
language: str = 'en'
include_timestamps: bool = True
```
- [ ] **Add dual transcript endpoint** (60 min)
```python
@router.post("/api/transcripts/dual/{video_id}")
async def get_dual_transcripts(
video_id: str,
options: TranscriptOptionsRequest,
current_user: User = Depends(get_current_user)
) -> TranscriptComparisonResponse:
# Implementation
```
- [ ] **Update existing pipeline to use transcript options** (30 min)
- Modify `SummaryPipeline` to accept transcript source preference
- Update processing status to show transcript method
- Add transcript quality metrics to summary result
**Deliverable**: New API endpoints for transcript selection
### Phase 3: Database Schema Updates (2 hours)
#### Task 3.1: Extend Summary Model (1 hour)
**Objective**: Add fields for transcript metadata and quality tracking
**Subtasks**:
- [ ] **Create database migration** (30 min)
```sql
ALTER TABLE summaries
ADD COLUMN transcript_source VARCHAR(20),
ADD COLUMN transcript_quality_score FLOAT,
ADD COLUMN youtube_transcript TEXT,
ADD COLUMN whisper_transcript TEXT,
ADD COLUMN whisper_processing_time FLOAT,
ADD COLUMN transcript_comparison_data JSON;
```
- [ ] **Update Summary model** (20 min)
```python
# Add to backend/models/summary.py
transcript_source = Column(String(20)) # 'youtube', 'whisper', 'both'
transcript_quality_score = Column(Float)
youtube_transcript = Column(Text)
whisper_transcript = Column(Text)
whisper_processing_time = Column(Float)
transcript_comparison_data = Column(JSON)
```
- [ ] **Update repository methods** (10 min)
- Add methods for storing dual transcript data
- Add queries for transcript source filtering
**Deliverable**: Database schema ready for dual transcripts
#### Task 3.2: Add Performance Indexes (1 hour)
**Objective**: Optimize database queries for transcript operations
**Subtasks**:
- [ ] **Create performance indexes** (30 min)
```sql
CREATE INDEX idx_summaries_transcript_source ON summaries(transcript_source);
CREATE INDEX idx_summaries_quality_score ON summaries(transcript_quality_score);
CREATE INDEX idx_summaries_processing_time ON summaries(whisper_processing_time);
```
- [ ] **Test query performance** (20 min)
- Verify index usage with EXPLAIN queries
- Test filtering by transcript source
- Benchmark query times with sample data
- [ ] **Run migration and test** (10 min)
- Apply migration to development database
- Verify all fields accessible
- Test with sample data insertion
**Deliverable**: Optimized database schema
### Phase 4: Frontend Implementation (6 hours)
#### Task 4.1: Create TranscriptSelector Component (2 hours)
**Objective**: UI component for transcript source selection
**Subtasks**:
- [ ] **Create base component** (45 min)
```tsx
interface TranscriptSelectorProps {
value: TranscriptSource
onChange: (source: TranscriptSource) => void
estimatedDuration?: number
disabled?: boolean
}
export function TranscriptSelector({...props}: TranscriptSelectorProps) {
// Radio group implementation with visual indicators
}
```
- [ ] **Add processing time estimation** (30 min)
- Calculate Whisper processing time based on video duration
- Show cost/time comparison for each option
- Display clear indicators (Fast/Free vs Accurate/Slower)
- [ ] **Style and accessibility** (45 min)
- Implement with Radix UI RadioGroup
- Add proper ARIA labels and descriptions
- Visual icons and quality indicators
- Responsive design for mobile/desktop
**Deliverable**: TranscriptSelector component ready for integration
#### Task 4.2: Add to SummarizeForm (1 hour)
**Objective**: Integrate transcript selection into existing form
**Subtasks**:
- [ ] **Update SummarizeForm component** (30 min)
```tsx
// Add to existing form state
const [transcriptSource, setTranscriptSource] = useState<TranscriptSource>('youtube')
// Add to form submission
const handleSubmit = async (data) => {
await processVideo({
...data,
transcript_options: {
source: transcriptSource,
// other options
}
})
}
```
- [ ] **Update form validation** (15 min)
- Add transcript options to form schema
- Validate transcript source selection
- Handle form submission with new fields
- [ ] **Test integration** (15 min)
- Verify form works with new component
- Test all transcript source options
- Ensure admin page compatibility
**Deliverable**: Updated form with transcript selection
#### Task 4.3: Create TranscriptComparison Component (2 hours)
**Objective**: Side-by-side transcript quality comparison
**Subtasks**:
- [ ] **Create comparison UI** (75 min)
```tsx
interface TranscriptComparisonProps {
youtubeTranscript: TranscriptResult
whisperTranscript: TranscriptResult
onSelectTranscript: (source: TranscriptSource) => void
}
export function TranscriptComparison({...props}: TranscriptComparisonProps) {
// Side-by-side comparison with difference highlighting
}
```
- [ ] **Implement difference highlighting** (30 min)
- Word-level diff algorithm
- Visual indicators for additions/changes
- Quality metric displays
- [ ] **Add selection controls** (15 min)
- Buttons to choose which transcript to use for summary
- Quality score badges
- Processing time comparison
**Deliverable**: TranscriptComparison component
#### Task 4.4: Update Processing UI (1 hour)
**Objective**: Show transcript processing status and method
**Subtasks**:
- [ ] **Update ProgressTracker** (30 min)
- Add transcript source indicator
- Show different messages for Whisper vs YouTube processing
- Add estimated time remaining for Whisper
- [ ] **Update result display** (20 min)
- Show which transcript source was used
- Display quality metrics
- Add transcript comparison link if both available
- [ ] **Error handling** (10 min)
- Handle Whisper processing failures
- Show fallback notifications
- Provide retry options
**Deliverable**: Updated processing UI
### Phase 5: Testing and Integration (2 hours)
#### Task 5.1: Unit Tests (1 hour)
**Objective**: Comprehensive test coverage for new components
**Subtasks**:
- [ ] **Backend unit tests** (30 min)
```python
# backend/tests/unit/test_whisper_transcript_service.py
def test_whisper_transcription_accuracy()
def test_dual_transcript_comparison()
def test_automatic_fallback()
```
- [ ] **Frontend unit tests** (20 min)
```tsx
// frontend/src/components/__tests__/TranscriptSelector.test.tsx
describe('TranscriptSelector', () => {
test('shows processing time estimates')
test('handles source selection')
test('displays quality indicators')
})
```
- [ ] **API endpoint tests** (10 min)
- Test dual transcript endpoint
- Test transcript option validation
- Test error handling scenarios
**Deliverable**: >80% test coverage for new code
#### Task 5.2: Integration Testing (1 hour)
**Objective**: End-to-end workflow validation
**Subtasks**:
- [ ] **YouTube vs Whisper comparison test** (20 min)
- Process same video with both methods
- Verify quality differences
- Confirm timing accuracy
- [ ] **Admin page testing** (15 min)
- Test transcript selector in admin interface
- Verify no authentication required
- Test all transcript source options
- [ ] **Error scenario testing** (15 min)
- Test unavailable YouTube captions (fallback to Whisper)
- Test Whisper processing failure
- Test long video processing (chunking)
- [ ] **Performance testing** (10 min)
- Benchmark Whisper processing times
- Test parallel processing performance
- Verify cache effectiveness
**Deliverable**: All integration scenarios passing
## Risk Mitigation Strategies
### High Risk Items and Solutions
#### 1. Whisper Processing Time (HIGH)
**Risk**: Users abandon due to slow Whisper processing
**Mitigation**:
- Clear time estimates before processing starts
- Real-time progress updates during Whisper transcription
- Option to cancel long-running operations
- Default to "small" model for speed/accuracy balance
#### 2. Resource Consumption (MEDIUM)
**Risk**: High CPU/memory usage affects system performance
**Mitigation**:
- Implement processing queue to limit concurrent Whisper jobs
- Add resource monitoring and automatic throttling
- Use model caching to avoid reloading
- Provide CPU/GPU auto-detection
#### 3. Model Download Size (MEDIUM)
**Risk**: First-time model download delays (244MB for "small")
**Mitigation**:
- Pre-download model in Docker image
- Show download progress to user
- Graceful handling of network issues during download
- Fallback to smaller model if download fails
#### 4. Audio Quality Issues (LOW)
**Risk**: Poor audio quality reduces Whisper accuracy
**Mitigation**:
- Audio preprocessing (noise reduction, normalization)
- Quality assessment before transcription
- Clear messaging about audio quality limitations
- Fallback to YouTube captions for poor audio
### Technical Debt Management
#### Dependency Management
- Pin specific Whisper version for reproducibility
- Test compatibility with torch versions
- Document system requirements (FFmpeg)
- Provide clear installation instructions
#### Code Quality
- Maintain consistent async/await patterns
- Add comprehensive logging for debugging
- Document Whisper-specific configuration
- Follow existing project patterns and conventions
## Success Criteria and Validation
### Definition of Done Checklist
- [ ] Users can select between YouTube/Whisper/Both transcript options
- [ ] Real Whisper transcription integrated from archived codebase
- [ ] Processing time estimates accurate within 20%
- [ ] Quality comparison shows meaningful differences
- [ ] Automatic fallback works when YouTube captions unavailable
- [ ] All tests pass with >80% code coverage
- [ ] Performance acceptable (<2 minutes for 10-minute video with "small" model)
- [ ] UI provides clear feedback during processing
- [ ] Database properly stores transcript metadata and quality scores
- [ ] Admin page supports new transcript options without authentication
### Acceptance Testing Scenarios
1. **Standard Use Case**: Select Whisper for technical video, confirm accuracy improvement
2. **Comparison Mode**: Use "Compare Both" option, review side-by-side differences
3. **Fallback Scenario**: Process video without YouTube captions, verify Whisper fallback
4. **Long Video**: Process 30+ minute video, confirm chunking works properly
5. **Error Handling**: Test with corrupted audio, verify graceful error handling
### Performance Benchmarks
- **YouTube Transcript**: <5 seconds processing time
- **Whisper Small**: <2 minutes for 10-minute video
- **Memory Usage**: <2GB peak during transcription
- **Model Loading**: <30 seconds first load, <5 seconds cached
- **Accuracy Improvement**: >25% fewer word errors vs YouTube captions
## Development Environment Setup
### Local Development Steps
```bash
# 1. Update backend dependencies
cd apps/youtube-summarizer/backend
pip install -r requirements.txt
# 2. Install system dependencies
# macOS
brew install ffmpeg
# Ubuntu
sudo apt-get install ffmpeg
# 3. Test Whisper installation
python -c "import whisper; model = whisper.load_model('base'); print('✅ Whisper ready')"
# 4. Run database migrations
alembic upgrade head
# 5. Start services
python main.py # Backend (port 8000)
cd ../frontend && npm run dev # Frontend (port 3002)
```
### Testing Strategy
```bash
# Unit tests
pytest backend/tests/unit/test_whisper_* -v
# Integration tests
pytest backend/tests/integration/test_dual_transcript* -v
# Frontend tests
cd frontend && npm test
# End-to-end testing
# 1. Visit http://localhost:3002/admin
# 2. Test YouTube transcript option with: https://www.youtube.com/watch?v=dQw4w9WgXcQ
# 3. Test Whisper option with same video
# 4. Compare results and processing times
```
## Timeline and Milestones
### Week 1 (Day 1-2): Backend Foundation
- **Day 1**: Tasks 1.1-1.2 (Copy TranscriptionService, update dependencies)
- **Day 2**: Task 1.3 (Replace MockWhisperService), Task 2.1 (DualTranscriptService)
### Week 1 (Day 3): API and Database
- **Day 3**: Task 2.2 (API endpoints), Task 3.1-3.2 (Database schema)
### Week 2 (Day 4): Frontend Implementation
- **Day 4**: Task 4.1-4.2 (TranscriptSelector, form integration)
### Week 2 (Day 5): Frontend Completion and Testing
- **Day 5 Morning**: Task 4.3-4.4 (TranscriptComparison, processing UI)
- **Day 5 Afternoon**: Task 5.1-5.2 (Testing, integration validation)
### Delivery Schedule
- **Day 3 EOD**: Backend MVP ready for testing
- **Day 4 EOD**: Frontend components complete
- **Day 5 EOD**: Full Story 4.1 complete and tested
## Post-Implementation Tasks
### Monitoring and Observability
- [ ] Add metrics for transcript source usage patterns
- [ ] Monitor Whisper processing times and success rates
- [ ] Track user satisfaction with transcript quality
- [ ] Log resource usage patterns for optimization
### Documentation Updates
- [ ] Update API documentation with new endpoints
- [ ] Add user guide for transcript options
- [ ] Document deployment requirements (FFmpeg, model caching)
- [ ] Update troubleshooting guide
### Future Enhancements (Epic 4.2+)
- [ ] Support for additional Whisper models (medium, large)
- [ ] Multi-language transcription support
- [ ] Custom model fine-tuning capabilities
- [ ] Speaker identification integration
- [ ] Real-time transcription for live streams
---
**Implementation Plan Owner**: Development Team
**Reviewers**: Technical Lead, Product Owner
**Status**: Ready for Implementation
**Last Updated**: 2025-08-27
This implementation plan provides a comprehensive roadmap for implementing dual transcript options, leveraging proven Whisper integration while maintaining high code quality and user experience standards.