# CLAUDE.md - YouTube Summarizer Backend This file provides guidance to Claude Code when working with the YouTube Summarizer backend services. ## Backend Architecture Overview The backend is built with FastAPI and follows a clean architecture pattern with clear separation of concerns: ``` backend/ ├── api/ # API endpoints and request/response models ├── services/ # Business logic and external integrations ├── models/ # Data models and database schemas ├── core/ # Core utilities, exceptions, and configurations └── tests/ # Unit and integration tests ``` ## Key Services and Components ### Authentication System (Story 3.1 - COMPLETE ✅) **Architecture**: Production-ready JWT-based authentication with Database Registry singleton pattern **AuthService** (`services/auth_service.py`) - JWT token generation and validation (access + refresh tokens) - Password hashing with bcrypt and strength validation - User registration with email verification workflow - Password reset with secure token generation - Session management and token refresh logic **Database Registry Pattern** (`core/database_registry.py`) - **CRITICAL FIX**: Resolves SQLAlchemy "Multiple classes found for path" errors - Singleton pattern ensuring single Base instance across application - Automatic model registration preventing table redefinition conflicts - Thread-safe model management with registry cleanup for testing - Production-ready architecture preventing relationship resolver issues **Authentication Models** (`models/user.py`) - User, RefreshToken, APIKey, EmailVerificationToken, PasswordResetToken - Fully qualified relationship paths preventing SQLAlchemy conflicts - String UUID fields for SQLite compatibility - Proper model inheritance using Database Registry Base **Authentication API** (`api/auth.py`) - Complete endpoint coverage: register, login, logout, refresh, verify email, reset password - Comprehensive input validation and error handling - Protected route dependencies and middleware - Async/await patterns throughout ### Dual Transcript Services ✅ **NEW** **DualTranscriptService** (`services/dual_transcript_service.py`) - Orchestrates between YouTube captions and Whisper AI transcription - Supports three extraction modes: `youtube`, `whisper`, `both` - Parallel processing for comparison mode with real-time progress updates - Advanced quality comparison with punctuation/capitalization analysis - Processing time estimation and intelligent recommendation engine - Seamless integration with existing TranscriptService **FasterWhisperTranscriptService** (`services/faster_whisper_transcript_service.py`) ✅ **UPGRADED** - **20-32x Speed Improvement**: Powered by faster-whisper (CTranslate2 optimization engine) - **Large-v3-Turbo Model**: Best accuracy/speed balance with advanced AI capabilities - **Intelligent Optimizations**: Voice Activity Detection (VAD), int8 quantization, GPU acceleration - **Native MP3 Support**: No audio conversion needed, direct processing - **Advanced Configuration**: Fully configurable via VideoDownloadConfig with environment variables - **Production Features**: Async processing, intelligent chunking, comprehensive metadata - **Performance Metrics**: Real-time speed ratios, processing time tracking, quality scoring ### Core Pipeline Services **IntelligentVideoDownloader** (`services/intelligent_video_downloader.py`) ✅ **NEW** - **9-Tier Transcript Extraction Fallback Chain**: 1. YouTube Transcript API - Primary method using official API 2. Auto-generated Captions - YouTube's automatic captions fallback 3. Whisper AI Transcription - OpenAI Whisper for high-quality audio transcription 4. PyTubeFix Downloader - Alternative YouTube library 5. YT-DLP Downloader - Robust video/audio extraction tool 6. Playwright Browser - Browser automation for JavaScript-rendered content 7. External Tools - 4K Video Downloader CLI integration 8. Web Services - Third-party transcript API services 9. Transcript-Only - Metadata without full transcript as final fallback - **Audio Retention System** for re-transcription capability - **Intelligent method selection** based on success rates - **Comprehensive error handling** with detailed logging - **Performance telemetry** and health monitoring **SummaryPipeline** (`services/summary_pipeline.py`) - Main orchestration service for end-to-end video processing - 7-stage async pipeline: URL validation → metadata extraction → transcript → analysis → summarization → quality validation → completion - Integrates with IntelligentVideoDownloader for robust transcript extraction - Intelligent content analysis and configuration optimization - Real-time progress tracking via WebSocket - Automatic retry logic with exponential backoff - Quality scoring and validation system **AnthropicSummarizer** (`services/anthropic_summarizer.py`) - AI service integration using Claude 3.5 Haiku for cost efficiency - Structured JSON output with fallback text parsing - Token counting and cost estimation - Intelligent chunking for long transcripts (up to 200k context) - Comprehensive error handling and retry logic **CacheManager** (`services/cache_manager.py`) - Multi-level caching for pipeline results, transcripts, and metadata - TTL-based expiration with automatic cleanup - Redis-ready architecture for production scaling - Configurable cache keys with collision prevention **WebSocketManager** (`core/websocket_manager.py`) - Singleton pattern for WebSocket connection management - Job-specific connection tracking and broadcasting - Real-time progress updates and completion notifications - Heartbeat mechanism and stale connection cleanup **NotificationService** (`services/notification_service.py`) - Multi-type notifications (completion, error, progress, system) - Notification history and statistics tracking - Email/webhook integration ready architecture - Configurable filtering and management ### API Layer **Pipeline API** (`api/pipeline.py`) - Complete pipeline management endpoints - Process video with configuration options - Status monitoring and job history - Pipeline cancellation and cleanup - Health checks and system statistics **Summarization API** (`api/summarization.py`) - Direct AI summarization endpoints - Sync and async processing options - Cost estimation and validation - Background job management **Dual Transcript API** (`api/transcripts.py`) ✅ **NEW** - `POST /api/transcripts/dual/extract` - Start dual transcript extraction - `GET /api/transcripts/dual/jobs/{job_id}` - Monitor extraction progress - `POST /api/transcripts/dual/estimate` - Get processing time estimates - `GET /api/transcripts/dual/compare/{video_id}` - Force comparison analysis - Background job processing with real-time progress updates - YouTube captions, Whisper AI, or both sources simultaneously ## Development Patterns ### Service Dependency Injection ```python def get_summary_pipeline( video_service: VideoService = Depends(get_video_service), transcript_service: TranscriptService = Depends(get_transcript_service), ai_service: AnthropicSummarizer = Depends(get_ai_service), cache_manager: CacheManager = Depends(get_cache_manager), notification_service: NotificationService = Depends(get_notification_service) ) -> SummaryPipeline: return SummaryPipeline(...) ``` ### Database Registry Pattern (CRITICAL ARCHITECTURE) **Problem Solved**: SQLAlchemy "Multiple classes found for path" relationship resolver errors ```python # Always use the registry for model creation from backend.core.database_registry import registry from backend.models.base import Model # Models inherit from Model (which uses registry.Base) class User(Model): __tablename__ = "users" # Use fully qualified relationship paths to prevent conflicts summaries = relationship("backend.models.summary.Summary", back_populates="user") # Registry ensures single Base instance and safe model registration registry.create_all_tables(engine) # For table creation registry.register_model(ModelClass) # Automatic via BaseModel mixin ``` **Key Benefits**: - Prevents SQLAlchemy table redefinition conflicts - Thread-safe singleton pattern - Automatic model registration and deduplication - Production-ready architecture - Clean testing with registry reset capabilities ### Authentication Pattern ```python # Protected endpoint with user dependency @router.post("/api/protected") async def protected_endpoint( current_user: User = Depends(get_current_user), db: Session = Depends(get_db) ): return {"user_id": current_user.id} # JWT token validation and refresh from backend.services.auth_service import AuthService auth_service = AuthService() user = await auth_service.authenticate_user(email, password) tokens = auth_service.create_access_token(user) ``` ### Async Pipeline Pattern ```python async def process_video(self, video_url: str, config: PipelineConfig = None) -> str: job_id = str(uuid.uuid4()) result = PipelineResult(job_id=job_id, video_url=video_url, ...) self.active_jobs[job_id] = result # Start background processing asyncio.create_task(self._execute_pipeline(job_id, config)) return job_id ``` ### Error Handling Pattern ```python try: result = await self.ai_service.generate_summary(request) except AIServiceError as e: raise HTTPException(status_code=500, detail={ "error": "AI service error", "message": e.message, "code": e.error_code }) ``` ## Configuration and Environment ### Required Environment Variables ```bash # Core Services ANTHROPIC_API_KEY=sk-ant-... # Required for AI summarization YOUTUBE_API_KEY=AIza... # YouTube Data API v3 key GOOGLE_API_KEY=AIza... # Google/Gemini API key # Feature Flags USE_MOCK_SERVICES=false # Disable mock services ENABLE_REAL_TRANSCRIPT_EXTRACTION=true # Enable real transcript extraction # Video Download & Storage Configuration VIDEO_DOWNLOAD_STORAGE_PATH=./video_storage # Base storage directory VIDEO_DOWNLOAD_KEEP_AUDIO_FILES=true # Save audio for re-transcription VIDEO_DOWNLOAD_AUDIO_CLEANUP_DAYS=30 # Audio retention period VIDEO_DOWNLOAD_MAX_STORAGE_GB=10 # Storage limit # Faster-Whisper Configuration (20-32x Speed Improvement) VIDEO_DOWNLOAD_WHISPER_MODEL=large-v3-turbo # Model: 'large-v3-turbo', 'large-v3', 'medium', 'small', 'base' VIDEO_DOWNLOAD_WHISPER_DEVICE=auto # Device: 'auto', 'cpu', 'cuda' VIDEO_DOWNLOAD_WHISPER_COMPUTE_TYPE=auto # Compute: 'auto', 'int8', 'float16', 'float32' VIDEO_DOWNLOAD_WHISPER_BEAM_SIZE=5 # Beam search size (1-10, higher = better quality) VIDEO_DOWNLOAD_WHISPER_VAD_FILTER=true # Voice Activity Detection (efficiency) VIDEO_DOWNLOAD_WHISPER_WORD_TIMESTAMPS=true # Word-level timestamps VIDEO_DOWNLOAD_WHISPER_TEMPERATURE=0.0 # Sampling temperature (0 = deterministic) VIDEO_DOWNLOAD_WHISPER_BEST_OF=5 # Number of candidates when sampling # Dependencies: faster-whisper automatically handles dependencies # pip install faster-whisper torch pydub yt-dlp pytubefix # GPU acceleration: CUDA automatically detected and used when available # Optional Configuration DATABASE_URL=sqlite:///./data/app.db # Database connection REDIS_URL=redis://localhost:6379/0 # Cache backend (optional) LOG_LEVEL=INFO # Logging level CORS_ORIGINS=http://localhost:3000 # Frontend origins ``` ### Service Configuration Services are configured through dependency injection with sensible defaults: ```python # Cost-optimized AI model ai_service = AnthropicSummarizer( api_key=api_key, model="claude-3-5-haiku-20241022" # Cost-effective choice ) # Cache with TTL cache_manager = CacheManager(default_ttl=3600) # 1 hour default # Pipeline with retry logic config = PipelineConfig( summary_length="standard", quality_threshold=0.7, max_retries=2, enable_notifications=True ) ``` ## Testing Strategy ### Unit Tests - **Location**: `tests/unit/` - **Coverage**: 17+ tests for pipeline orchestration - **Mocking**: All external services mocked - **Patterns**: Async test patterns with proper fixtures ### Integration Tests - **Location**: `tests/integration/` - **Coverage**: 20+ API endpoint scenarios - **Testing**: Full FastAPI integration with TestClient - **Validation**: Request/response validation and error handling ### Running Tests ```bash # From backend directory PYTHONPATH=/path/to/youtube-summarizer python3 -m pytest tests/unit/ -v PYTHONPATH=/path/to/youtube-summarizer python3 -m pytest tests/integration/ -v # With coverage python3 -m pytest tests/ --cov=backend --cov-report=html ``` ## Common Development Tasks ### Adding New API Endpoints 1. Create endpoint in appropriate `api/` module 2. Add business logic to `services/` layer 3. Update `main.py` to include router 4. Add unit and integration tests 5. Update API documentation ### Adding New Services 1. Create service class in `services/` 2. Implement proper async patterns 3. Add error handling with custom exceptions 4. Create dependency injection function 5. Add comprehensive unit tests ### Debugging Pipeline Issues ```python # Enable detailed logging import logging logging.getLogger("backend").setLevel(logging.DEBUG) # Check pipeline status pipeline = get_summary_pipeline() result = await pipeline.get_pipeline_result(job_id) print(f"Status: {result.status}, Error: {result.error}") # Monitor active jobs active_jobs = pipeline.get_active_jobs() print(f"Active jobs: {len(active_jobs)}") ``` ## Performance Optimization ### Faster-Whisper Performance (✅ MAJOR UPGRADE) - **20-32x Speed Improvement**: CTranslate2 optimization engine provides massive speed gains - **Large-v3-Turbo Model**: Combines best accuracy with 5-8x additional speed over large-v3 - **Intelligent Processing**: Voice Activity Detection reduces processing time by filtering silence - **CPU Optimization**: int8 quantization provides excellent performance even without GPU - **GPU Acceleration**: Automatic CUDA detection and utilization when available - **Native MP3**: Direct processing without audio conversion overhead - **Real-time Performance**: Typical 2-3x faster than realtime processing speeds **Benchmark Results** (3.6 minute video): - **Processing Time**: 94 seconds (vs ~30+ minutes with OpenAI Whisper) - **Quality Score**: 1.000 (perfect transcription accuracy) - **Confidence Score**: 0.962 (very high confidence) - **Speed Ratio**: 2.3x faster than realtime ### Async Patterns - All I/O operations use async/await - Background tasks for long-running operations - Connection pooling for external services - Proper exception handling to prevent blocking ### Caching Strategy - Pipeline results cached for 1 hour - Transcript and metadata cached separately - Cache invalidation on video updates - Redis-ready for distributed caching ### Cost Optimization - Claude 3.5 Haiku for 80% cost savings vs GPT-4 - Intelligent chunking prevents token waste - Cost estimation and limits - Quality scoring to avoid unnecessary retries ## Security Considerations ### API Security - Environment variable for API keys - Input validation on all endpoints - Rate limiting (implement with Redis) - CORS configuration for frontend origins ### Error Sanitization ```python # Never expose internal errors to clients except Exception as e: logger.error(f"Internal error: {e}") raise HTTPException(status_code=500, detail="Internal server error") ``` ### Content Validation ```python # Validate transcript length if len(request.transcript.strip()) < 50: raise HTTPException(status_code=400, detail="Transcript too short") ``` ## Monitoring and Observability ### Health Checks - `/api/health` - Service health status - `/api/stats` - Pipeline processing statistics - WebSocket connection monitoring - Background job tracking ### Logging - Structured logging with JSON format - Error tracking with context - Performance metrics logging - Request/response logging (without sensitive data) ### Metrics ```python # Built-in metrics stats = { "active_jobs": len(pipeline.get_active_jobs()), "cache_stats": await cache_manager.get_cache_stats(), "notification_stats": notification_service.get_notification_stats(), "websocket_connections": websocket_manager.get_stats() } ``` ## Deployment Considerations ### Production Configuration - Use Redis for caching and session storage - Configure proper logging (structured JSON) - Set up health checks and monitoring - Use environment-specific configuration - Enable HTTPS and security headers ### Scaling Patterns - Stateless design enables horizontal scaling - Background job processing via task queue - Database connection pooling - Load balancer health checks ### Database Migrations & Epic 4 Features **Current Status:** ✅ Epic 4 migration complete (add_epic_4_features) **Database Schema:** 21 tables including Epic 4 features: - **Multi-Agent Tables:** `agent_summaries`, `prompt_templates` - **Enhanced Export Tables:** `export_metadata`, `summary_sections` - **RAG Chat Tables:** `chat_sessions`, `chat_messages`, `video_chunks` - **Analytics Tables:** `playlist_analysis`, `rag_analytics`, `prompt_experiments` **Migration Commands:** ```bash # Check migration status python3 ../../scripts/utilities/migration_manager.py status # Apply migrations (from backend directory) PYTHONPATH=/Users/enias/projects/my-ai-projects/apps/youtube-summarizer \ ../venv/bin/python3 -m alembic upgrade head # Create new migration python3 -m alembic revision --autogenerate -m "Add new feature" ``` **Python 3.11 Requirement:** Epic 4 requires Python 3.11+ for: - `chromadb`: Vector database for RAG functionality - `sentence-transformers`: Embedding generation for semantic search - `aiohttp`: Async HTTP client for DeepSeek API integration **Environment Setup:** ```bash # Remove old environment if needed rm -rf venv # Create Python 3.11 virtual environment /opt/homebrew/bin/python3.11 -m venv venv source venv/bin/activate pip install -r requirements.txt # Install Epic 4 dependencies pip install chromadb sentence-transformers aiohttp # Verify installation python --version # Should show Python 3.11.x ``` ## Troubleshooting ### Common Issues **"Pydantic validation error: Extra inputs are not permitted"** - Issue: Environment variables not defined in Settings model - Solution: Add `extra = "ignore"` to Config class in `core/config.py` **"Table already exists" during migration** - Issue: Database already has tables that migration tries to create - Solution: Use `alembic stamp existing_revision` then `alembic upgrade head` **"Multiple head revisions present"** - Issue: Multiple migration branches need merging - Solution: Use `alembic merge head1 head2 -m "Merge branches"` **"Python 3.9 compatibility issues with Epic 4"** - Issue: ChromaDB and modern AI libraries require Python 3.11+ - Solution: Recreate virtual environment with Python 3.11 (see Environment Setup above) **"Anthropic API key not configured"** - Solution: Set `ANTHROPIC_API_KEY` environment variable **"Mock data returned instead of real transcripts"** - Check: `USE_MOCK_SERVICES=false` in .env - Solution: Set `ENABLE_REAL_TRANSCRIPT_EXTRACTION=true` **"404 Not Found for /api/transcripts/extract"** - Check: Import statements in main.py - Solution: Use `from backend.api.transcripts import router` (not transcripts_stub) **"Radio button selection not working"** - Issue: Circular state updates in React - Solution: Use ref tracking in useTranscriptSelector hook **"VAD filter removes all audio / 0 segments generated"** - Issue: Voice Activity Detection too aggressive for music/instrumental content - Solution: Set `VIDEO_DOWNLOAD_WHISPER_VAD_FILTER=false` for music videos - Alternative: Use `whisper_vad_filter=False` in service configuration **"Faster-whisper model download fails"** - Issue: Network issues downloading large-v3-turbo model from HuggingFace - Solution: Model will automatically fallback to standard large-v3 - Check: Ensure internet connection for initial model download **"CPU transcription too slow"** - Issue: CPU-only processing on large models - Solution: Use smaller model (`base` or `small`) or enable GPU acceleration - Config: `VIDEO_DOWNLOAD_WHISPER_MODEL=base` for faster CPU processing **Pipeline jobs stuck in "processing" state** - Check: `pipeline.get_active_jobs()` for zombie jobs - Solution: Restart service or call cleanup endpoint **WebSocket connections not receiving updates** - Check: WebSocket connection in browser dev tools - Solution: Verify WebSocket manager singleton initialization **High AI costs** - Check: Summary length configuration and transcript sizes - Solution: Implement cost limits and brief summary defaults **Transcript extraction failures** - Check: IntelligentVideoDownloader fallback chain logs - Solution: Review which tier failed and check API keys/dependencies ### Debug Commands ```python # Pipeline debugging from backend.services.summary_pipeline import SummaryPipeline pipeline = SummaryPipeline(...) result = await pipeline.get_pipeline_result("job_id") # Cache debugging from backend.services.cache_manager import CacheManager cache = CacheManager() stats = await cache.get_cache_stats() # WebSocket debugging from backend.core.websocket_manager import websocket_manager connections = websocket_manager.get_stats() ``` This backend is designed for production use with comprehensive error handling, monitoring, and scalability patterns. All services follow async patterns and clean architecture principles.