youtube-summarizer/backend/CLAUDE.md

21 KiB

CLAUDE.md - YouTube Summarizer Backend

This file provides guidance to Claude Code when working with the YouTube Summarizer backend services.

Backend Architecture Overview

The backend is built with FastAPI and follows a clean architecture pattern with clear separation of concerns:

backend/
├── api/                    # API endpoints and request/response models
├── services/              # Business logic and external integrations
├── models/                # Data models and database schemas
├── core/                  # Core utilities, exceptions, and configurations
└── tests/                 # Unit and integration tests

Key Services and Components

Authentication System (Story 3.1 - COMPLETE )

Architecture: Production-ready JWT-based authentication with Database Registry singleton pattern

AuthService (services/auth_service.py)

  • JWT token generation and validation (access + refresh tokens)
  • Password hashing with bcrypt and strength validation
  • User registration with email verification workflow
  • Password reset with secure token generation
  • Session management and token refresh logic

Database Registry Pattern (core/database_registry.py)

  • CRITICAL FIX: Resolves SQLAlchemy "Multiple classes found for path" errors
  • Singleton pattern ensuring single Base instance across application
  • Automatic model registration preventing table redefinition conflicts
  • Thread-safe model management with registry cleanup for testing
  • Production-ready architecture preventing relationship resolver issues

Authentication Models (models/user.py)

  • User, RefreshToken, APIKey, EmailVerificationToken, PasswordResetToken
  • Fully qualified relationship paths preventing SQLAlchemy conflicts
  • String UUID fields for SQLite compatibility
  • Proper model inheritance using Database Registry Base

Authentication API (api/auth.py)

  • Complete endpoint coverage: register, login, logout, refresh, verify email, reset password
  • Comprehensive input validation and error handling
  • Protected route dependencies and middleware
  • Async/await patterns throughout

Dual Transcript Services NEW

DualTranscriptService (services/dual_transcript_service.py)

  • Orchestrates between YouTube captions and Whisper AI transcription
  • Supports three extraction modes: youtube, whisper, both
  • Parallel processing for comparison mode with real-time progress updates
  • Advanced quality comparison with punctuation/capitalization analysis
  • Processing time estimation and intelligent recommendation engine
  • Seamless integration with existing TranscriptService

FasterWhisperTranscriptService (services/faster_whisper_transcript_service.py) UPGRADED

  • 20-32x Speed Improvement: Powered by faster-whisper (CTranslate2 optimization engine)
  • Large-v3-Turbo Model: Best accuracy/speed balance with advanced AI capabilities
  • Intelligent Optimizations: Voice Activity Detection (VAD), int8 quantization, GPU acceleration
  • Native MP3 Support: No audio conversion needed, direct processing
  • Advanced Configuration: Fully configurable via VideoDownloadConfig with environment variables
  • Production Features: Async processing, intelligent chunking, comprehensive metadata
  • Performance Metrics: Real-time speed ratios, processing time tracking, quality scoring

Core Pipeline Services

IntelligentVideoDownloader (services/intelligent_video_downloader.py) NEW

  • 9-Tier Transcript Extraction Fallback Chain:
    1. YouTube Transcript API - Primary method using official API
    2. Auto-generated Captions - YouTube's automatic captions fallback
    3. Whisper AI Transcription - OpenAI Whisper for high-quality audio transcription
    4. PyTubeFix Downloader - Alternative YouTube library
    5. YT-DLP Downloader - Robust video/audio extraction tool
    6. Playwright Browser - Browser automation for JavaScript-rendered content
    7. External Tools - 4K Video Downloader CLI integration
    8. Web Services - Third-party transcript API services
    9. Transcript-Only - Metadata without full transcript as final fallback
  • Audio Retention System for re-transcription capability
  • Intelligent method selection based on success rates
  • Comprehensive error handling with detailed logging
  • Performance telemetry and health monitoring

SummaryPipeline (services/summary_pipeline.py)

  • Main orchestration service for end-to-end video processing
  • 7-stage async pipeline: URL validation → metadata extraction → transcript → analysis → summarization → quality validation → completion
  • Integrates with IntelligentVideoDownloader for robust transcript extraction
  • Intelligent content analysis and configuration optimization
  • Real-time progress tracking via WebSocket
  • Automatic retry logic with exponential backoff
  • Quality scoring and validation system

AnthropicSummarizer (services/anthropic_summarizer.py)

  • AI service integration using Claude 3.5 Haiku for cost efficiency
  • Structured JSON output with fallback text parsing
  • Token counting and cost estimation
  • Intelligent chunking for long transcripts (up to 200k context)
  • Comprehensive error handling and retry logic

CacheManager (services/cache_manager.py)

  • Multi-level caching for pipeline results, transcripts, and metadata
  • TTL-based expiration with automatic cleanup
  • Redis-ready architecture for production scaling
  • Configurable cache keys with collision prevention

WebSocketManager (core/websocket_manager.py)

  • Singleton pattern for WebSocket connection management
  • Job-specific connection tracking and broadcasting
  • Real-time progress updates and completion notifications
  • Heartbeat mechanism and stale connection cleanup

NotificationService (services/notification_service.py)

  • Multi-type notifications (completion, error, progress, system)
  • Notification history and statistics tracking
  • Email/webhook integration ready architecture
  • Configurable filtering and management

API Layer

Pipeline API (api/pipeline.py)

  • Complete pipeline management endpoints
  • Process video with configuration options
  • Status monitoring and job history
  • Pipeline cancellation and cleanup
  • Health checks and system statistics

Summarization API (api/summarization.py)

  • Direct AI summarization endpoints
  • Sync and async processing options
  • Cost estimation and validation
  • Background job management

Dual Transcript API (api/transcripts.py) NEW

  • POST /api/transcripts/dual/extract - Start dual transcript extraction
  • GET /api/transcripts/dual/jobs/{job_id} - Monitor extraction progress
  • POST /api/transcripts/dual/estimate - Get processing time estimates
  • GET /api/transcripts/dual/compare/{video_id} - Force comparison analysis
  • Background job processing with real-time progress updates
  • YouTube captions, Whisper AI, or both sources simultaneously

Development Patterns

Service Dependency Injection

def get_summary_pipeline(
    video_service: VideoService = Depends(get_video_service),
    transcript_service: TranscriptService = Depends(get_transcript_service),
    ai_service: AnthropicSummarizer = Depends(get_ai_service),
    cache_manager: CacheManager = Depends(get_cache_manager),
    notification_service: NotificationService = Depends(get_notification_service)
) -> SummaryPipeline:
    return SummaryPipeline(...)

Database Registry Pattern (CRITICAL ARCHITECTURE)

Problem Solved: SQLAlchemy "Multiple classes found for path" relationship resolver errors

# Always use the registry for model creation
from backend.core.database_registry import registry
from backend.models.base import Model

# Models inherit from Model (which uses registry.Base)
class User(Model):
    __tablename__ = "users"
    # Use fully qualified relationship paths to prevent conflicts
    summaries = relationship("backend.models.summary.Summary", back_populates="user")

# Registry ensures single Base instance and safe model registration
registry.create_all_tables(engine)  # For table creation
registry.register_model(ModelClass)  # Automatic via BaseModel mixin

Key Benefits:

  • Prevents SQLAlchemy table redefinition conflicts
  • Thread-safe singleton pattern
  • Automatic model registration and deduplication
  • Production-ready architecture
  • Clean testing with registry reset capabilities

Authentication Pattern

# Protected endpoint with user dependency
@router.post("/api/protected")
async def protected_endpoint(
    current_user: User = Depends(get_current_user),
    db: Session = Depends(get_db)
):
    return {"user_id": current_user.id}

# JWT token validation and refresh
from backend.services.auth_service import AuthService
auth_service = AuthService()
user = await auth_service.authenticate_user(email, password)
tokens = auth_service.create_access_token(user)

Async Pipeline Pattern

async def process_video(self, video_url: str, config: PipelineConfig = None) -> str:
    job_id = str(uuid.uuid4())
    result = PipelineResult(job_id=job_id, video_url=video_url, ...)
    self.active_jobs[job_id] = result
    
    # Start background processing
    asyncio.create_task(self._execute_pipeline(job_id, config))
    return job_id

Error Handling Pattern

try:
    result = await self.ai_service.generate_summary(request)
except AIServiceError as e:
    raise HTTPException(status_code=500, detail={
        "error": "AI service error",
        "message": e.message,
        "code": e.error_code
    })

Configuration and Environment

Required Environment Variables

# Core Services
ANTHROPIC_API_KEY=sk-ant-...           # Required for AI summarization
YOUTUBE_API_KEY=AIza...                # YouTube Data API v3 key
GOOGLE_API_KEY=AIza...                 # Google/Gemini API key

# Feature Flags
USE_MOCK_SERVICES=false                # Disable mock services
ENABLE_REAL_TRANSCRIPT_EXTRACTION=true # Enable real transcript extraction

# Video Download & Storage Configuration
VIDEO_DOWNLOAD_STORAGE_PATH=./video_storage     # Base storage directory
VIDEO_DOWNLOAD_KEEP_AUDIO_FILES=true           # Save audio for re-transcription
VIDEO_DOWNLOAD_AUDIO_CLEANUP_DAYS=30           # Audio retention period
VIDEO_DOWNLOAD_MAX_STORAGE_GB=10               # Storage limit

# Faster-Whisper Configuration (20-32x Speed Improvement)
VIDEO_DOWNLOAD_WHISPER_MODEL=large-v3-turbo    # Model: 'large-v3-turbo', 'large-v3', 'medium', 'small', 'base'
VIDEO_DOWNLOAD_WHISPER_DEVICE=auto             # Device: 'auto', 'cpu', 'cuda'
VIDEO_DOWNLOAD_WHISPER_COMPUTE_TYPE=auto       # Compute: 'auto', 'int8', 'float16', 'float32'
VIDEO_DOWNLOAD_WHISPER_BEAM_SIZE=5             # Beam search size (1-10, higher = better quality)
VIDEO_DOWNLOAD_WHISPER_VAD_FILTER=true         # Voice Activity Detection (efficiency)
VIDEO_DOWNLOAD_WHISPER_WORD_TIMESTAMPS=true    # Word-level timestamps
VIDEO_DOWNLOAD_WHISPER_TEMPERATURE=0.0         # Sampling temperature (0 = deterministic)
VIDEO_DOWNLOAD_WHISPER_BEST_OF=5               # Number of candidates when sampling

# Dependencies: faster-whisper automatically handles dependencies
# pip install faster-whisper torch pydub yt-dlp pytubefix
# GPU acceleration: CUDA automatically detected and used when available

# Optional Configuration
DATABASE_URL=sqlite:///./data/app.db   # Database connection
REDIS_URL=redis://localhost:6379/0    # Cache backend (optional)
LOG_LEVEL=INFO                         # Logging level
CORS_ORIGINS=http://localhost:3000     # Frontend origins

Service Configuration

Services are configured through dependency injection with sensible defaults:

# Cost-optimized AI model
ai_service = AnthropicSummarizer(
    api_key=api_key, 
    model="claude-3-5-haiku-20241022"  # Cost-effective choice
)

# Cache with TTL
cache_manager = CacheManager(default_ttl=3600)  # 1 hour default

# Pipeline with retry logic
config = PipelineConfig(
    summary_length="standard",
    quality_threshold=0.7,
    max_retries=2,
    enable_notifications=True
)

Testing Strategy

Unit Tests

  • Location: tests/unit/
  • Coverage: 17+ tests for pipeline orchestration
  • Mocking: All external services mocked
  • Patterns: Async test patterns with proper fixtures

Integration Tests

  • Location: tests/integration/
  • Coverage: 20+ API endpoint scenarios
  • Testing: Full FastAPI integration with TestClient
  • Validation: Request/response validation and error handling

Running Tests

# From backend directory
PYTHONPATH=/path/to/youtube-summarizer python3 -m pytest tests/unit/ -v
PYTHONPATH=/path/to/youtube-summarizer python3 -m pytest tests/integration/ -v

# With coverage
python3 -m pytest tests/ --cov=backend --cov-report=html

Common Development Tasks

Adding New API Endpoints

  1. Create endpoint in appropriate api/ module
  2. Add business logic to services/ layer
  3. Update main.py to include router
  4. Add unit and integration tests
  5. Update API documentation

Adding New Services

  1. Create service class in services/
  2. Implement proper async patterns
  3. Add error handling with custom exceptions
  4. Create dependency injection function
  5. Add comprehensive unit tests

Debugging Pipeline Issues

# Enable detailed logging
import logging
logging.getLogger("backend").setLevel(logging.DEBUG)

# Check pipeline status
pipeline = get_summary_pipeline()
result = await pipeline.get_pipeline_result(job_id)
print(f"Status: {result.status}, Error: {result.error}")

# Monitor active jobs
active_jobs = pipeline.get_active_jobs()
print(f"Active jobs: {len(active_jobs)}")

Performance Optimization

Faster-Whisper Performance ( MAJOR UPGRADE)

  • 20-32x Speed Improvement: CTranslate2 optimization engine provides massive speed gains
  • Large-v3-Turbo Model: Combines best accuracy with 5-8x additional speed over large-v3
  • Intelligent Processing: Voice Activity Detection reduces processing time by filtering silence
  • CPU Optimization: int8 quantization provides excellent performance even without GPU
  • GPU Acceleration: Automatic CUDA detection and utilization when available
  • Native MP3: Direct processing without audio conversion overhead
  • Real-time Performance: Typical 2-3x faster than realtime processing speeds

Benchmark Results (3.6 minute video):

  • Processing Time: 94 seconds (vs ~30+ minutes with OpenAI Whisper)
  • Quality Score: 1.000 (perfect transcription accuracy)
  • Confidence Score: 0.962 (very high confidence)
  • Speed Ratio: 2.3x faster than realtime

Async Patterns

  • All I/O operations use async/await
  • Background tasks for long-running operations
  • Connection pooling for external services
  • Proper exception handling to prevent blocking

Caching Strategy

  • Pipeline results cached for 1 hour
  • Transcript and metadata cached separately
  • Cache invalidation on video updates
  • Redis-ready for distributed caching

Cost Optimization

  • Claude 3.5 Haiku for 80% cost savings vs GPT-4
  • Intelligent chunking prevents token waste
  • Cost estimation and limits
  • Quality scoring to avoid unnecessary retries

Security Considerations

API Security

  • Environment variable for API keys
  • Input validation on all endpoints
  • Rate limiting (implement with Redis)
  • CORS configuration for frontend origins

Error Sanitization

# Never expose internal errors to clients
except Exception as e:
    logger.error(f"Internal error: {e}")
    raise HTTPException(status_code=500, detail="Internal server error")

Content Validation

# Validate transcript length
if len(request.transcript.strip()) < 50:
    raise HTTPException(status_code=400, detail="Transcript too short")

Monitoring and Observability

Health Checks

  • /api/health - Service health status
  • /api/stats - Pipeline processing statistics
  • WebSocket connection monitoring
  • Background job tracking

Logging

  • Structured logging with JSON format
  • Error tracking with context
  • Performance metrics logging
  • Request/response logging (without sensitive data)

Metrics

# Built-in metrics
stats = {
    "active_jobs": len(pipeline.get_active_jobs()),
    "cache_stats": await cache_manager.get_cache_stats(),
    "notification_stats": notification_service.get_notification_stats(),
    "websocket_connections": websocket_manager.get_stats()
}

Deployment Considerations

Production Configuration

  • Use Redis for caching and session storage
  • Configure proper logging (structured JSON)
  • Set up health checks and monitoring
  • Use environment-specific configuration
  • Enable HTTPS and security headers

Scaling Patterns

  • Stateless design enables horizontal scaling
  • Background job processing via task queue
  • Database connection pooling
  • Load balancer health checks

Database Migrations & Epic 4 Features

Current Status: Epic 4 migration complete (add_epic_4_features)

Database Schema: 21 tables including Epic 4 features:

  • Multi-Agent Tables: agent_summaries, prompt_templates
  • Enhanced Export Tables: export_metadata, summary_sections
  • RAG Chat Tables: chat_sessions, chat_messages, video_chunks
  • Analytics Tables: playlist_analysis, rag_analytics, prompt_experiments

Migration Commands:

# Check migration status
python3 ../../scripts/utilities/migration_manager.py status

# Apply migrations (from backend directory)
PYTHONPATH=/Users/enias/projects/my-ai-projects/apps/youtube-summarizer \
  ../venv/bin/python3 -m alembic upgrade head

# Create new migration
python3 -m alembic revision --autogenerate -m "Add new feature"

Python 3.11 Requirement: Epic 4 requires Python 3.11+ for:

  • chromadb: Vector database for RAG functionality
  • sentence-transformers: Embedding generation for semantic search
  • aiohttp: Async HTTP client for DeepSeek API integration

Environment Setup:

# Remove old environment if needed
rm -rf venv

# Create Python 3.11 virtual environment
/opt/homebrew/bin/python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Install Epic 4 dependencies
pip install chromadb sentence-transformers aiohttp

# Verify installation
python --version  # Should show Python 3.11.x

Troubleshooting

Common Issues

"Pydantic validation error: Extra inputs are not permitted"

  • Issue: Environment variables not defined in Settings model
  • Solution: Add extra = "ignore" to Config class in core/config.py

"Table already exists" during migration

  • Issue: Database already has tables that migration tries to create
  • Solution: Use alembic stamp existing_revision then alembic upgrade head

"Multiple head revisions present"

  • Issue: Multiple migration branches need merging
  • Solution: Use alembic merge head1 head2 -m "Merge branches"

"Python 3.9 compatibility issues with Epic 4"

  • Issue: ChromaDB and modern AI libraries require Python 3.11+
  • Solution: Recreate virtual environment with Python 3.11 (see Environment Setup above)

"Anthropic API key not configured"

  • Solution: Set ANTHROPIC_API_KEY environment variable

"Mock data returned instead of real transcripts"

  • Check: USE_MOCK_SERVICES=false in .env
  • Solution: Set ENABLE_REAL_TRANSCRIPT_EXTRACTION=true

"404 Not Found for /api/transcripts/extract"

  • Check: Import statements in main.py
  • Solution: Use from backend.api.transcripts import router (not transcripts_stub)

"Radio button selection not working"

  • Issue: Circular state updates in React
  • Solution: Use ref tracking in useTranscriptSelector hook

"VAD filter removes all audio / 0 segments generated"

  • Issue: Voice Activity Detection too aggressive for music/instrumental content
  • Solution: Set VIDEO_DOWNLOAD_WHISPER_VAD_FILTER=false for music videos
  • Alternative: Use whisper_vad_filter=False in service configuration

"Faster-whisper model download fails"

  • Issue: Network issues downloading large-v3-turbo model from HuggingFace
  • Solution: Model will automatically fallback to standard large-v3
  • Check: Ensure internet connection for initial model download

"CPU transcription too slow"

  • Issue: CPU-only processing on large models
  • Solution: Use smaller model (base or small) or enable GPU acceleration
  • Config: VIDEO_DOWNLOAD_WHISPER_MODEL=base for faster CPU processing

Pipeline jobs stuck in "processing" state

  • Check: pipeline.get_active_jobs() for zombie jobs
  • Solution: Restart service or call cleanup endpoint

WebSocket connections not receiving updates

  • Check: WebSocket connection in browser dev tools
  • Solution: Verify WebSocket manager singleton initialization

High AI costs

  • Check: Summary length configuration and transcript sizes
  • Solution: Implement cost limits and brief summary defaults

Transcript extraction failures

  • Check: IntelligentVideoDownloader fallback chain logs
  • Solution: Review which tier failed and check API keys/dependencies

Debug Commands

# Pipeline debugging
from backend.services.summary_pipeline import SummaryPipeline
pipeline = SummaryPipeline(...)
result = await pipeline.get_pipeline_result("job_id")

# Cache debugging
from backend.services.cache_manager import CacheManager
cache = CacheManager()
stats = await cache.get_cache_stats()

# WebSocket debugging
from backend.core.websocket_manager import websocket_manager
connections = websocket_manager.get_stats()

This backend is designed for production use with comprehensive error handling, monitoring, and scalability patterns. All services follow async patterns and clean architecture principles.