20 KiB

Raw Blame History

Story 4.1: Dual Transcript Options - Implementation Plan

Overview

This implementation plan details the step-by-step approach to implement dual transcript options (YouTube + Whisper) in the YouTube Summarizer project, leveraging existing proven Whisper integration from archived projects.

Story: 4.1 - Dual Transcript Options (YouTube + Whisper)
Estimated Effort: 22 hours
Priority: High 🔥
Target Completion: 3-4 development days

Prerequisites Analysis

Current Codebase Status ✅

Backend Dependencies Available:

FastAPI, Pydantic, SQLAlchemy - Core framework ready
Anthropic API integration - AI summarization ready
Enhanced transcript service - Architecture foundation exists
Video download service - Audio extraction capability ready
WebSocket infrastructure - Real-time updates ready

Frontend Dependencies Available:

React 18, TypeScript, Vite - Modern frontend ready
shadcn/ui, Radix UI components - UI components available
TanStack Query - State management ready
Admin page infrastructure - Testing environment ready

Missing Dependencies to Add:

# Backend requirements additions
openai-whisper==20231117
torch>=2.0.0
librosa==0.10.1
pydub==0.25.1
soundfile==0.12.1

# System dependency (Docker/local)
ffmpeg

Architecture Integration Points

Existing Services to Modify:

backend/services/enhanced_transcript_service.py - Replace MockWhisperService
backend/api/transcript.py - Add dual transcript endpoints
frontend/src/components/forms/SummarizeForm.tsx - Add transcript selector
Database models - Add transcript metadata fields

New Components to Create:

backend/services/whisper_transcript_service.py - Real Whisper integration
frontend/src/components/TranscriptSelector.tsx - Source selection UI
frontend/src/components/TranscriptComparison.tsx - Quality comparison UI

Task Breakdown with Time Estimates

Phase 1: Backend Foundation (8 hours)

Task 1.1: Copy and Adapt TranscriptionService (3 hours)

Objective: Integrate proven Whisper service from archived project

Subtasks:

Copy source code (30 min)

cp archived_projects/personal-ai-assistant-v1.1.0/src/services/transcription_service.py \
   apps/youtube-summarizer/backend/services/whisper_transcript_service.py

Remove podcast dependencies (45 min)
- Remove PodcastEpisode, PodcastTranscript imports
- Remove repository dependency injection
- Simplify constructor to not require database repository
Adapt for YouTube context (60 min)
- Update transcribe_episode() → transcribe_audio_file()
- Modify segment storage to return data instead of database writes
- Update error handling for YouTube-specific scenarios
Add async compatibility (45 min)
- Wrap synchronous Whisper calls in asyncio.run_in_executor()
- Update method signatures to async/await pattern
- Test async integration with existing services

Deliverable: Working WhisperTranscriptService class

Task 1.2: Update Dependencies and Environment (2 hours)

Objective: Add Whisper dependencies and test environment setup

Subtasks:

Update requirements.txt (15 min)

# Add to backend/requirements.txt
openai-whisper==20231117
torch>=2.0.0
librosa==0.10.1
pydub==0.25.1
soundfile==0.12.1

Update Docker configuration (45 min)

# Add to backend/Dockerfile
RUN apt-get update && apt-get install -y ffmpeg
RUN pip install openai-whisper torch librosa pydub soundfile

Test Whisper model download (30 min)
- Test "small" model download (~244MB)
- Verify CUDA detection works (if available)
- Add model caching directory configuration

Environment configuration (30 min)

# Add to .env
WHISPER_MODEL_SIZE=small
WHISPER_DEVICE=auto
WHISPER_MODEL_CACHE=/tmp/whisper_models

Deliverable: Environment ready for Whisper integration

Task 1.3: Replace MockWhisperService (3 hours)

Objective: Integrate real Whisper service into existing enhanced transcript service

Subtasks:

Update EnhancedTranscriptService (90 min)

# In enhanced_transcript_service.py
from .whisper_transcript_service import WhisperTranscriptService

# Replace MockWhisperService instantiation
self.whisper_service = WhisperTranscriptService(
    model_size=os.getenv('WHISPER_MODEL_SIZE', 'small')
)

Update dependency injection (30 min)
- Modify main.py service initialization
- Update FastAPI dependency functions
- Ensure proper service lifecycle management
Test integration (60 min)
- Unit test with sample audio file
- Integration test with video download service
- Verify transcript quality and timing

Deliverable: Working Whisper integration in existing service

Phase 2: API Enhancement (4 hours)

Task 2.1: Create Dual Transcript Service (2 hours)

Objective: Implement service for handling dual transcript extraction

Subtasks:

Create DualTranscriptService class (60 min)

class DualTranscriptService(EnhancedTranscriptService):
    async def extract_dual_transcripts(self, video_id: str) -> Dict[str, TranscriptResult]:
        # Parallel processing of YouTube and Whisper
        youtube_task = self._extract_youtube_transcript(video_id)
        whisper_task = self._extract_whisper_transcript(video_id)

        results = await asyncio.gather(
            youtube_task, whisper_task, return_exceptions=True
        )
        return {'youtube': results[0], 'whisper': results[1]}

Implement quality comparison (45 min)
- Word-by-word accuracy comparison algorithm
- Confidence score calculation
- Timing precision analysis
Add caching for dual results (15 min)
- Cache YouTube and Whisper results separately
- Extended TTL for Whisper (more expensive to regenerate)

Deliverable: DualTranscriptService with parallel processing

Task 2.2: Add New API Endpoints (2 hours)

Objective: Create API endpoints for transcript source selection

Subtasks:

Create transcript selection models (30 min)

class TranscriptOptionsRequest(BaseModel):
    source: Literal['youtube', 'whisper', 'both'] = 'youtube'
    whisper_model: Literal['tiny', 'base', 'small', 'medium'] = 'small'
    language: str = 'en'
    include_timestamps: bool = True

Add dual transcript endpoint (60 min)

@router.post("/api/transcripts/dual/{video_id}")
async def get_dual_transcripts(
    video_id: str,
    options: TranscriptOptionsRequest,
    current_user: User = Depends(get_current_user)
) -> TranscriptComparisonResponse:
    # Implementation

Update existing pipeline to use transcript options (30 min)
- Modify SummaryPipeline to accept transcript source preference
- Update processing status to show transcript method
- Add transcript quality metrics to summary result

Deliverable: New API endpoints for transcript selection

Phase 3: Database Schema Updates (2 hours)

Task 3.1: Extend Summary Model (1 hour)

Objective: Add fields for transcript metadata and quality tracking

Subtasks:

Create database migration (30 min)

ALTER TABLE summaries 
ADD COLUMN transcript_source VARCHAR(20),
ADD COLUMN transcript_quality_score FLOAT,
ADD COLUMN youtube_transcript TEXT,
ADD COLUMN whisper_transcript TEXT,
ADD COLUMN whisper_processing_time FLOAT,
ADD COLUMN transcript_comparison_data JSON;

Update Summary model (20 min)

# Add to backend/models/summary.py
transcript_source = Column(String(20))  # 'youtube', 'whisper', 'both'
transcript_quality_score = Column(Float)
youtube_transcript = Column(Text)
whisper_transcript = Column(Text)
whisper_processing_time = Column(Float)
transcript_comparison_data = Column(JSON)

Update repository methods (10 min)
- Add methods for storing dual transcript data
- Add queries for transcript source filtering

Deliverable: Database schema ready for dual transcripts

Task 3.2: Add Performance Indexes (1 hour)

Objective: Optimize database queries for transcript operations

Subtasks:

Create performance indexes (30 min)

CREATE INDEX idx_summaries_transcript_source ON summaries(transcript_source);
CREATE INDEX idx_summaries_quality_score ON summaries(transcript_quality_score);
CREATE INDEX idx_summaries_processing_time ON summaries(whisper_processing_time);

Test query performance (20 min)
- Verify index usage with EXPLAIN queries
- Test filtering by transcript source
- Benchmark query times with sample data
Run migration and test (10 min)
- Apply migration to development database
- Verify all fields accessible
- Test with sample data insertion

Deliverable: Optimized database schema

Phase 4: Frontend Implementation (6 hours)

Task 4.1: Create TranscriptSelector Component (2 hours)

Objective: UI component for transcript source selection

Subtasks:

Create base component (45 min)

interface TranscriptSelectorProps {
  value: TranscriptSource
  onChange: (source: TranscriptSource) => void
  estimatedDuration?: number
  disabled?: boolean
}

export function TranscriptSelector({...props}: TranscriptSelectorProps) {
  // Radio group implementation with visual indicators
}

Add processing time estimation (30 min)
- Calculate Whisper processing time based on video duration
- Show cost/time comparison for each option
- Display clear indicators (Fast/Free vs Accurate/Slower)
Style and accessibility (45 min)
- Implement with Radix UI RadioGroup
- Add proper ARIA labels and descriptions
- Visual icons and quality indicators
- Responsive design for mobile/desktop

Deliverable: TranscriptSelector component ready for integration

Task 4.2: Add to SummarizeForm (1 hour)

Objective: Integrate transcript selection into existing form

Subtasks:

Update SummarizeForm component (30 min)

// Add to existing form state
const [transcriptSource, setTranscriptSource] = useState<TranscriptSource>('youtube')

// Add to form submission
const handleSubmit = async (data) => {
  await processVideo({
    ...data,
    transcript_options: {
      source: transcriptSource,
      // other options
    }
  })
}

Update form validation (15 min)
- Add transcript options to form schema
- Validate transcript source selection
- Handle form submission with new fields
Test integration (15 min)
- Verify form works with new component
- Test all transcript source options
- Ensure admin page compatibility

Deliverable: Updated form with transcript selection

Task 4.3: Create TranscriptComparison Component (2 hours)

Objective: Side-by-side transcript quality comparison

Subtasks:

Create comparison UI (75 min)

interface TranscriptComparisonProps {
  youtubeTranscript: TranscriptResult
  whisperTranscript: TranscriptResult
  onSelectTranscript: (source: TranscriptSource) => void
}

export function TranscriptComparison({...props}: TranscriptComparisonProps) {
  // Side-by-side comparison with difference highlighting
}

Implement difference highlighting (30 min)
- Word-level diff algorithm
- Visual indicators for additions/changes
- Quality metric displays
Add selection controls (15 min)
- Buttons to choose which transcript to use for summary
- Quality score badges
- Processing time comparison

Deliverable: TranscriptComparison component

Task 4.4: Update Processing UI (1 hour)

Objective: Show transcript processing status and method

Subtasks:

Update ProgressTracker (30 min)
- Add transcript source indicator
- Show different messages for Whisper vs YouTube processing
- Add estimated time remaining for Whisper
Update result display (20 min)
- Show which transcript source was used
- Display quality metrics
- Add transcript comparison link if both available
Error handling (10 min)
- Handle Whisper processing failures
- Show fallback notifications
- Provide retry options

Deliverable: Updated processing UI

Phase 5: Testing and Integration (2 hours)

Task 5.1: Unit Tests (1 hour)

Objective: Comprehensive test coverage for new components

Subtasks:

Backend unit tests (30 min)

# backend/tests/unit/test_whisper_transcript_service.py
def test_whisper_transcription_accuracy()
def test_dual_transcript_comparison() 
def test_automatic_fallback()

Frontend unit tests (20 min)

// frontend/src/components/__tests__/TranscriptSelector.test.tsx
describe('TranscriptSelector', () => {
  test('shows processing time estimates')
  test('handles source selection')
  test('displays quality indicators')
})

API endpoint tests (10 min)
- Test dual transcript endpoint
- Test transcript option validation
- Test error handling scenarios

Deliverable: >80% test coverage for new code

Task 5.2: Integration Testing (1 hour)

Objective: End-to-end workflow validation

Subtasks:

YouTube vs Whisper comparison test (20 min)
- Process same video with both methods
- Verify quality differences
- Confirm timing accuracy
Admin page testing (15 min)
- Test transcript selector in admin interface
- Verify no authentication required
- Test all transcript source options
Error scenario testing (15 min)
- Test unavailable YouTube captions (fallback to Whisper)
- Test Whisper processing failure
- Test long video processing (chunking)
Performance testing (10 min)
- Benchmark Whisper processing times
- Test parallel processing performance
- Verify cache effectiveness

Deliverable: All integration scenarios passing

Risk Mitigation Strategies

High Risk Items and Solutions

1. Whisper Processing Time (HIGH)

Risk: Users abandon due to slow Whisper processing
Mitigation:

Clear time estimates before processing starts
Real-time progress updates during Whisper transcription
Option to cancel long-running operations
Default to "small" model for speed/accuracy balance

2. Resource Consumption (MEDIUM)

Risk: High CPU/memory usage affects system performance
Mitigation:

Implement processing queue to limit concurrent Whisper jobs
Add resource monitoring and automatic throttling
Use model caching to avoid reloading
Provide CPU/GPU auto-detection

3. Model Download Size (MEDIUM)

Risk: First-time model download delays (244MB for "small")
Mitigation:

Pre-download model in Docker image
Show download progress to user
Graceful handling of network issues during download
Fallback to smaller model if download fails

4. Audio Quality Issues (LOW)

Risk: Poor audio quality reduces Whisper accuracy
Mitigation:

Audio preprocessing (noise reduction, normalization)
Quality assessment before transcription
Clear messaging about audio quality limitations
Fallback to YouTube captions for poor audio

Technical Debt Management

Dependency Management

Pin specific Whisper version for reproducibility
Test compatibility with torch versions
Document system requirements (FFmpeg)
Provide clear installation instructions

Code Quality

Maintain consistent async/await patterns
Add comprehensive logging for debugging
Document Whisper-specific configuration
Follow existing project patterns and conventions

Success Criteria and Validation

Definition of Done Checklist

Users can select between YouTube/Whisper/Both transcript options
Real Whisper transcription integrated from archived codebase
Processing time estimates accurate within 20%
Quality comparison shows meaningful differences
Automatic fallback works when YouTube captions unavailable
All tests pass with >80% code coverage
Performance acceptable (<2 minutes for 10-minute video with "small" model)
UI provides clear feedback during processing
Database properly stores transcript metadata and quality scores
Admin page supports new transcript options without authentication

Acceptance Testing Scenarios

Standard Use Case: Select Whisper for technical video, confirm accuracy improvement
Comparison Mode: Use "Compare Both" option, review side-by-side differences
Fallback Scenario: Process video without YouTube captions, verify Whisper fallback
Long Video: Process 30+ minute video, confirm chunking works properly
Error Handling: Test with corrupted audio, verify graceful error handling

Performance Benchmarks

YouTube Transcript: <5 seconds processing time
Whisper Small: <2 minutes for 10-minute video
Memory Usage: <2GB peak during transcription
Model Loading: <30 seconds first load, <5 seconds cached
Accuracy Improvement: >25% fewer word errors vs YouTube captions

Development Environment Setup

Local Development Steps

# 1. Update backend dependencies
cd apps/youtube-summarizer/backend
pip install -r requirements.txt

# 2. Install system dependencies  
# macOS
brew install ffmpeg
# Ubuntu
sudo apt-get install ffmpeg

# 3. Test Whisper installation
python -c "import whisper; model = whisper.load_model('base'); print('✅ Whisper ready')"

# 4. Run database migrations
alembic upgrade head

# 5. Start services
python main.py  # Backend (port 8000)
cd ../frontend && npm run dev  # Frontend (port 3002)

Testing Strategy

# Unit tests
pytest backend/tests/unit/test_whisper_* -v

# Integration tests  
pytest backend/tests/integration/test_dual_transcript* -v

# Frontend tests
cd frontend && npm test

# End-to-end testing
# 1. Visit http://localhost:3002/admin
# 2. Test YouTube transcript option with: https://www.youtube.com/watch?v=dQw4w9WgXcQ
# 3. Test Whisper option with same video
# 4. Compare results and processing times

Timeline and Milestones

Week 1 (Day 1-2): Backend Foundation

Day 1: Tasks 1.1-1.2 (Copy TranscriptionService, update dependencies)
Day 2: Task 1.3 (Replace MockWhisperService), Task 2.1 (DualTranscriptService)

Week 1 (Day 3): API and Database

Day 3: Task 2.2 (API endpoints), Task 3.1-3.2 (Database schema)

Week 2 (Day 4): Frontend Implementation

Day 4: Task 4.1-4.2 (TranscriptSelector, form integration)

Week 2 (Day 5): Frontend Completion and Testing

Day 5 Morning: Task 4.3-4.4 (TranscriptComparison, processing UI)
Day 5 Afternoon: Task 5.1-5.2 (Testing, integration validation)

Delivery Schedule

Day 3 EOD: Backend MVP ready for testing
Day 4 EOD: Frontend components complete
Day 5 EOD: Full Story 4.1 complete and tested

Post-Implementation Tasks

Monitoring and Observability

Add metrics for transcript source usage patterns
Monitor Whisper processing times and success rates
Track user satisfaction with transcript quality
Log resource usage patterns for optimization

Documentation Updates

Update API documentation with new endpoints
Add user guide for transcript options
Document deployment requirements (FFmpeg, model caching)
Update troubleshooting guide

Future Enhancements (Epic 4.2+)

Support for additional Whisper models (medium, large)
Multi-language transcription support
Custom model fine-tuning capabilities
Speaker identification integration
Real-time transcription for live streams

Implementation Plan Owner: Development Team
Reviewers: Technical Lead, Product Owner
Status: Ready for Implementation
Last Updated: 2025-08-27

This implementation plan provides a comprehensive roadmap for implementing dual transcript options, leveraging proven Whisper integration while maintaining high code quality and user experience standards.

20 KiB Raw Blame History

Story 4.1: Dual Transcript Options - Implementation Plan

Overview

Prerequisites Analysis

Current Codebase Status ✅

Architecture Integration Points

Task Breakdown with Time Estimates

Phase 1: Backend Foundation (8 hours)

Task 1.1: Copy and Adapt TranscriptionService (3 hours)

Task 1.2: Update Dependencies and Environment (2 hours)

Task 1.3: Replace MockWhisperService (3 hours)

Phase 2: API Enhancement (4 hours)

Task 2.1: Create Dual Transcript Service (2 hours)

Task 2.2: Add New API Endpoints (2 hours)

Phase 3: Database Schema Updates (2 hours)

Task 3.1: Extend Summary Model (1 hour)

Task 3.2: Add Performance Indexes (1 hour)

Phase 4: Frontend Implementation (6 hours)

Task 4.1: Create TranscriptSelector Component (2 hours)

Task 4.2: Add to SummarizeForm (1 hour)

Task 4.3: Create TranscriptComparison Component (2 hours)

Task 4.4: Update Processing UI (1 hour)

Phase 5: Testing and Integration (2 hours)

Task 5.1: Unit Tests (1 hour)

Task 5.2: Integration Testing (1 hour)

Risk Mitigation Strategies

High Risk Items and Solutions

1. Whisper Processing Time (HIGH)

2. Resource Consumption (MEDIUM)

3. Model Download Size (MEDIUM)

4. Audio Quality Issues (LOW)

Technical Debt Management

Dependency Management

Code Quality

Success Criteria and Validation

Definition of Done Checklist

Acceptance Testing Scenarios

Performance Benchmarks

Development Environment Setup

Local Development Steps

Testing Strategy

Timeline and Milestones

Week 1 (Day 1-2): Backend Foundation

Week 1 (Day 3): API and Database

Week 2 (Day 4): Frontend Implementation

Week 2 (Day 5): Frontend Completion and Testing

Delivery Schedule

Post-Implementation Tasks

Monitoring and Observability

Documentation Updates

Future Enhancements (Epic 4.2+)

20 KiB

Raw Blame History