20 KiB
Story 4.1: Dual Transcript Options - Implementation Plan
Overview
This implementation plan details the step-by-step approach to implement dual transcript options (YouTube + Whisper) in the YouTube Summarizer project, leveraging existing proven Whisper integration from archived projects.
Story: 4.1 - Dual Transcript Options (YouTube + Whisper)
Estimated Effort: 22 hours
Priority: High 🔥
Target Completion: 3-4 development days
Prerequisites Analysis
Current Codebase Status ✅
Backend Dependencies Available:
- FastAPI, Pydantic, SQLAlchemy - Core framework ready
- Anthropic API integration - AI summarization ready
- Enhanced transcript service - Architecture foundation exists
- Video download service - Audio extraction capability ready
- WebSocket infrastructure - Real-time updates ready
Frontend Dependencies Available:
- React 18, TypeScript, Vite - Modern frontend ready
- shadcn/ui, Radix UI components - UI components available
- TanStack Query - State management ready
- Admin page infrastructure - Testing environment ready
Missing Dependencies to Add:
# Backend requirements additions
openai-whisper==20231117
torch>=2.0.0
librosa==0.10.1
pydub==0.25.1
soundfile==0.12.1
# System dependency (Docker/local)
ffmpeg
Architecture Integration Points
Existing Services to Modify:
backend/services/enhanced_transcript_service.py- Replace MockWhisperServicebackend/api/transcript.py- Add dual transcript endpointsfrontend/src/components/forms/SummarizeForm.tsx- Add transcript selector- Database models - Add transcript metadata fields
New Components to Create:
backend/services/whisper_transcript_service.py- Real Whisper integrationfrontend/src/components/TranscriptSelector.tsx- Source selection UIfrontend/src/components/TranscriptComparison.tsx- Quality comparison UI
Task Breakdown with Time Estimates
Phase 1: Backend Foundation (8 hours)
Task 1.1: Copy and Adapt TranscriptionService (3 hours)
Objective: Integrate proven Whisper service from archived project
Subtasks:
-
Copy source code (30 min)
cp archived_projects/personal-ai-assistant-v1.1.0/src/services/transcription_service.py \ apps/youtube-summarizer/backend/services/whisper_transcript_service.py -
Remove podcast dependencies (45 min)
- Remove
PodcastEpisode,PodcastTranscriptimports - Remove repository dependency injection
- Simplify constructor to not require database repository
- Remove
-
Adapt for YouTube context (60 min)
- Update
transcribe_episode()→transcribe_audio_file() - Modify segment storage to return data instead of database writes
- Update error handling for YouTube-specific scenarios
- Update
-
Add async compatibility (45 min)
- Wrap synchronous Whisper calls in asyncio.run_in_executor()
- Update method signatures to async/await pattern
- Test async integration with existing services
Deliverable: Working WhisperTranscriptService class
Task 1.2: Update Dependencies and Environment (2 hours)
Objective: Add Whisper dependencies and test environment setup
Subtasks:
-
Update requirements.txt (15 min)
# Add to backend/requirements.txt openai-whisper==20231117 torch>=2.0.0 librosa==0.10.1 pydub==0.25.1 soundfile==0.12.1 -
Update Docker configuration (45 min)
# Add to backend/Dockerfile RUN apt-get update && apt-get install -y ffmpeg RUN pip install openai-whisper torch librosa pydub soundfile -
Test Whisper model download (30 min)
- Test "small" model download (~244MB)
- Verify CUDA detection works (if available)
- Add model caching directory configuration
-
Environment configuration (30 min)
# Add to .env WHISPER_MODEL_SIZE=small WHISPER_DEVICE=auto WHISPER_MODEL_CACHE=/tmp/whisper_models
Deliverable: Environment ready for Whisper integration
Task 1.3: Replace MockWhisperService (3 hours)
Objective: Integrate real Whisper service into existing enhanced transcript service
Subtasks:
-
Update EnhancedTranscriptService (90 min)
# In enhanced_transcript_service.py from .whisper_transcript_service import WhisperTranscriptService # Replace MockWhisperService instantiation self.whisper_service = WhisperTranscriptService( model_size=os.getenv('WHISPER_MODEL_SIZE', 'small') ) -
Update dependency injection (30 min)
- Modify
main.pyservice initialization - Update FastAPI dependency functions
- Ensure proper service lifecycle management
- Modify
-
Test integration (60 min)
- Unit test with sample audio file
- Integration test with video download service
- Verify transcript quality and timing
Deliverable: Working Whisper integration in existing service
Phase 2: API Enhancement (4 hours)
Task 2.1: Create Dual Transcript Service (2 hours)
Objective: Implement service for handling dual transcript extraction
Subtasks:
-
Create DualTranscriptService class (60 min)
class DualTranscriptService(EnhancedTranscriptService): async def extract_dual_transcripts(self, video_id: str) -> Dict[str, TranscriptResult]: # Parallel processing of YouTube and Whisper youtube_task = self._extract_youtube_transcript(video_id) whisper_task = self._extract_whisper_transcript(video_id) results = await asyncio.gather( youtube_task, whisper_task, return_exceptions=True ) return {'youtube': results[0], 'whisper': results[1]} -
Implement quality comparison (45 min)
- Word-by-word accuracy comparison algorithm
- Confidence score calculation
- Timing precision analysis
-
Add caching for dual results (15 min)
- Cache YouTube and Whisper results separately
- Extended TTL for Whisper (more expensive to regenerate)
Deliverable: DualTranscriptService with parallel processing
Task 2.2: Add New API Endpoints (2 hours)
Objective: Create API endpoints for transcript source selection
Subtasks:
-
Create transcript selection models (30 min)
class TranscriptOptionsRequest(BaseModel): source: Literal['youtube', 'whisper', 'both'] = 'youtube' whisper_model: Literal['tiny', 'base', 'small', 'medium'] = 'small' language: str = 'en' include_timestamps: bool = True -
Add dual transcript endpoint (60 min)
@router.post("/api/transcripts/dual/{video_id}") async def get_dual_transcripts( video_id: str, options: TranscriptOptionsRequest, current_user: User = Depends(get_current_user) ) -> TranscriptComparisonResponse: # Implementation -
Update existing pipeline to use transcript options (30 min)
- Modify
SummaryPipelineto accept transcript source preference - Update processing status to show transcript method
- Add transcript quality metrics to summary result
- Modify
Deliverable: New API endpoints for transcript selection
Phase 3: Database Schema Updates (2 hours)
Task 3.1: Extend Summary Model (1 hour)
Objective: Add fields for transcript metadata and quality tracking
Subtasks:
-
Create database migration (30 min)
ALTER TABLE summaries ADD COLUMN transcript_source VARCHAR(20), ADD COLUMN transcript_quality_score FLOAT, ADD COLUMN youtube_transcript TEXT, ADD COLUMN whisper_transcript TEXT, ADD COLUMN whisper_processing_time FLOAT, ADD COLUMN transcript_comparison_data JSON; -
Update Summary model (20 min)
# Add to backend/models/summary.py transcript_source = Column(String(20)) # 'youtube', 'whisper', 'both' transcript_quality_score = Column(Float) youtube_transcript = Column(Text) whisper_transcript = Column(Text) whisper_processing_time = Column(Float) transcript_comparison_data = Column(JSON) -
Update repository methods (10 min)
- Add methods for storing dual transcript data
- Add queries for transcript source filtering
Deliverable: Database schema ready for dual transcripts
Task 3.2: Add Performance Indexes (1 hour)
Objective: Optimize database queries for transcript operations
Subtasks:
-
Create performance indexes (30 min)
CREATE INDEX idx_summaries_transcript_source ON summaries(transcript_source); CREATE INDEX idx_summaries_quality_score ON summaries(transcript_quality_score); CREATE INDEX idx_summaries_processing_time ON summaries(whisper_processing_time); -
Test query performance (20 min)
- Verify index usage with EXPLAIN queries
- Test filtering by transcript source
- Benchmark query times with sample data
-
Run migration and test (10 min)
- Apply migration to development database
- Verify all fields accessible
- Test with sample data insertion
Deliverable: Optimized database schema
Phase 4: Frontend Implementation (6 hours)
Task 4.1: Create TranscriptSelector Component (2 hours)
Objective: UI component for transcript source selection
Subtasks:
-
Create base component (45 min)
interface TranscriptSelectorProps { value: TranscriptSource onChange: (source: TranscriptSource) => void estimatedDuration?: number disabled?: boolean } export function TranscriptSelector({...props}: TranscriptSelectorProps) { // Radio group implementation with visual indicators } -
Add processing time estimation (30 min)
- Calculate Whisper processing time based on video duration
- Show cost/time comparison for each option
- Display clear indicators (Fast/Free vs Accurate/Slower)
-
Style and accessibility (45 min)
- Implement with Radix UI RadioGroup
- Add proper ARIA labels and descriptions
- Visual icons and quality indicators
- Responsive design for mobile/desktop
Deliverable: TranscriptSelector component ready for integration
Task 4.2: Add to SummarizeForm (1 hour)
Objective: Integrate transcript selection into existing form
Subtasks:
-
Update SummarizeForm component (30 min)
// Add to existing form state const [transcriptSource, setTranscriptSource] = useState<TranscriptSource>('youtube') // Add to form submission const handleSubmit = async (data) => { await processVideo({ ...data, transcript_options: { source: transcriptSource, // other options } }) } -
Update form validation (15 min)
- Add transcript options to form schema
- Validate transcript source selection
- Handle form submission with new fields
-
Test integration (15 min)
- Verify form works with new component
- Test all transcript source options
- Ensure admin page compatibility
Deliverable: Updated form with transcript selection
Task 4.3: Create TranscriptComparison Component (2 hours)
Objective: Side-by-side transcript quality comparison
Subtasks:
-
Create comparison UI (75 min)
interface TranscriptComparisonProps { youtubeTranscript: TranscriptResult whisperTranscript: TranscriptResult onSelectTranscript: (source: TranscriptSource) => void } export function TranscriptComparison({...props}: TranscriptComparisonProps) { // Side-by-side comparison with difference highlighting } -
Implement difference highlighting (30 min)
- Word-level diff algorithm
- Visual indicators for additions/changes
- Quality metric displays
-
Add selection controls (15 min)
- Buttons to choose which transcript to use for summary
- Quality score badges
- Processing time comparison
Deliverable: TranscriptComparison component
Task 4.4: Update Processing UI (1 hour)
Objective: Show transcript processing status and method
Subtasks:
-
Update ProgressTracker (30 min)
- Add transcript source indicator
- Show different messages for Whisper vs YouTube processing
- Add estimated time remaining for Whisper
-
Update result display (20 min)
- Show which transcript source was used
- Display quality metrics
- Add transcript comparison link if both available
-
Error handling (10 min)
- Handle Whisper processing failures
- Show fallback notifications
- Provide retry options
Deliverable: Updated processing UI
Phase 5: Testing and Integration (2 hours)
Task 5.1: Unit Tests (1 hour)
Objective: Comprehensive test coverage for new components
Subtasks:
-
Backend unit tests (30 min)
# backend/tests/unit/test_whisper_transcript_service.py def test_whisper_transcription_accuracy() def test_dual_transcript_comparison() def test_automatic_fallback() -
Frontend unit tests (20 min)
// frontend/src/components/__tests__/TranscriptSelector.test.tsx describe('TranscriptSelector', () => { test('shows processing time estimates') test('handles source selection') test('displays quality indicators') }) -
API endpoint tests (10 min)
- Test dual transcript endpoint
- Test transcript option validation
- Test error handling scenarios
Deliverable: >80% test coverage for new code
Task 5.2: Integration Testing (1 hour)
Objective: End-to-end workflow validation
Subtasks:
-
YouTube vs Whisper comparison test (20 min)
- Process same video with both methods
- Verify quality differences
- Confirm timing accuracy
-
Admin page testing (15 min)
- Test transcript selector in admin interface
- Verify no authentication required
- Test all transcript source options
-
Error scenario testing (15 min)
- Test unavailable YouTube captions (fallback to Whisper)
- Test Whisper processing failure
- Test long video processing (chunking)
-
Performance testing (10 min)
- Benchmark Whisper processing times
- Test parallel processing performance
- Verify cache effectiveness
Deliverable: All integration scenarios passing
Risk Mitigation Strategies
High Risk Items and Solutions
1. Whisper Processing Time (HIGH)
Risk: Users abandon due to slow Whisper processing
Mitigation:
- Clear time estimates before processing starts
- Real-time progress updates during Whisper transcription
- Option to cancel long-running operations
- Default to "small" model for speed/accuracy balance
2. Resource Consumption (MEDIUM)
Risk: High CPU/memory usage affects system performance
Mitigation:
- Implement processing queue to limit concurrent Whisper jobs
- Add resource monitoring and automatic throttling
- Use model caching to avoid reloading
- Provide CPU/GPU auto-detection
3. Model Download Size (MEDIUM)
Risk: First-time model download delays (244MB for "small")
Mitigation:
- Pre-download model in Docker image
- Show download progress to user
- Graceful handling of network issues during download
- Fallback to smaller model if download fails
4. Audio Quality Issues (LOW)
Risk: Poor audio quality reduces Whisper accuracy
Mitigation:
- Audio preprocessing (noise reduction, normalization)
- Quality assessment before transcription
- Clear messaging about audio quality limitations
- Fallback to YouTube captions for poor audio
Technical Debt Management
Dependency Management
- Pin specific Whisper version for reproducibility
- Test compatibility with torch versions
- Document system requirements (FFmpeg)
- Provide clear installation instructions
Code Quality
- Maintain consistent async/await patterns
- Add comprehensive logging for debugging
- Document Whisper-specific configuration
- Follow existing project patterns and conventions
Success Criteria and Validation
Definition of Done Checklist
- Users can select between YouTube/Whisper/Both transcript options
- Real Whisper transcription integrated from archived codebase
- Processing time estimates accurate within 20%
- Quality comparison shows meaningful differences
- Automatic fallback works when YouTube captions unavailable
- All tests pass with >80% code coverage
- Performance acceptable (<2 minutes for 10-minute video with "small" model)
- UI provides clear feedback during processing
- Database properly stores transcript metadata and quality scores
- Admin page supports new transcript options without authentication
Acceptance Testing Scenarios
- Standard Use Case: Select Whisper for technical video, confirm accuracy improvement
- Comparison Mode: Use "Compare Both" option, review side-by-side differences
- Fallback Scenario: Process video without YouTube captions, verify Whisper fallback
- Long Video: Process 30+ minute video, confirm chunking works properly
- Error Handling: Test with corrupted audio, verify graceful error handling
Performance Benchmarks
- YouTube Transcript: <5 seconds processing time
- Whisper Small: <2 minutes for 10-minute video
- Memory Usage: <2GB peak during transcription
- Model Loading: <30 seconds first load, <5 seconds cached
- Accuracy Improvement: >25% fewer word errors vs YouTube captions
Development Environment Setup
Local Development Steps
# 1. Update backend dependencies
cd apps/youtube-summarizer/backend
pip install -r requirements.txt
# 2. Install system dependencies
# macOS
brew install ffmpeg
# Ubuntu
sudo apt-get install ffmpeg
# 3. Test Whisper installation
python -c "import whisper; model = whisper.load_model('base'); print('✅ Whisper ready')"
# 4. Run database migrations
alembic upgrade head
# 5. Start services
python main.py # Backend (port 8000)
cd ../frontend && npm run dev # Frontend (port 3002)
Testing Strategy
# Unit tests
pytest backend/tests/unit/test_whisper_* -v
# Integration tests
pytest backend/tests/integration/test_dual_transcript* -v
# Frontend tests
cd frontend && npm test
# End-to-end testing
# 1. Visit http://localhost:3002/admin
# 2. Test YouTube transcript option with: https://www.youtube.com/watch?v=dQw4w9WgXcQ
# 3. Test Whisper option with same video
# 4. Compare results and processing times
Timeline and Milestones
Week 1 (Day 1-2): Backend Foundation
- Day 1: Tasks 1.1-1.2 (Copy TranscriptionService, update dependencies)
- Day 2: Task 1.3 (Replace MockWhisperService), Task 2.1 (DualTranscriptService)
Week 1 (Day 3): API and Database
- Day 3: Task 2.2 (API endpoints), Task 3.1-3.2 (Database schema)
Week 2 (Day 4): Frontend Implementation
- Day 4: Task 4.1-4.2 (TranscriptSelector, form integration)
Week 2 (Day 5): Frontend Completion and Testing
- Day 5 Morning: Task 4.3-4.4 (TranscriptComparison, processing UI)
- Day 5 Afternoon: Task 5.1-5.2 (Testing, integration validation)
Delivery Schedule
- Day 3 EOD: Backend MVP ready for testing
- Day 4 EOD: Frontend components complete
- Day 5 EOD: Full Story 4.1 complete and tested
Post-Implementation Tasks
Monitoring and Observability
- Add metrics for transcript source usage patterns
- Monitor Whisper processing times and success rates
- Track user satisfaction with transcript quality
- Log resource usage patterns for optimization
Documentation Updates
- Update API documentation with new endpoints
- Add user guide for transcript options
- Document deployment requirements (FFmpeg, model caching)
- Update troubleshooting guide
Future Enhancements (Epic 4.2+)
- Support for additional Whisper models (medium, large)
- Multi-language transcription support
- Custom model fine-tuning capabilities
- Speaker identification integration
- Real-time transcription for live streams
Implementation Plan Owner: Development Team
Reviewers: Technical Lead, Product Owner
Status: Ready for Implementation
Last Updated: 2025-08-27
This implementation plan provides a comprehensive roadmap for implementing dual transcript options, leveraging proven Whisper integration while maintaining high code quality and user experience standards.