21 KiB

Raw Blame History

CLAUDE.md - YouTube Summarizer Backend

This file provides guidance to Claude Code when working with the YouTube Summarizer backend services.

Backend Architecture Overview

The backend is built with FastAPI and follows a clean architecture pattern with clear separation of concerns:

backend/
├── api/                    # API endpoints and request/response models
├── services/              # Business logic and external integrations
├── models/                # Data models and database schemas
├── core/                  # Core utilities, exceptions, and configurations
└── tests/                 # Unit and integration tests

Key Services and Components

Authentication System (Story 3.1 - COMPLETE ✅)

Architecture: Production-ready JWT-based authentication with Database Registry singleton pattern

AuthService (services/auth_service.py)

JWT token generation and validation (access + refresh tokens)
Password hashing with bcrypt and strength validation
User registration with email verification workflow
Password reset with secure token generation
Session management and token refresh logic

Database Registry Pattern (core/database_registry.py)

CRITICAL FIX: Resolves SQLAlchemy "Multiple classes found for path" errors
Singleton pattern ensuring single Base instance across application
Automatic model registration preventing table redefinition conflicts
Thread-safe model management with registry cleanup for testing
Production-ready architecture preventing relationship resolver issues

Authentication Models (models/user.py)

User, RefreshToken, APIKey, EmailVerificationToken, PasswordResetToken
Fully qualified relationship paths preventing SQLAlchemy conflicts
String UUID fields for SQLite compatibility
Proper model inheritance using Database Registry Base

Authentication API (api/auth.py)

Complete endpoint coverage: register, login, logout, refresh, verify email, reset password
Comprehensive input validation and error handling
Protected route dependencies and middleware
Async/await patterns throughout

Dual Transcript Services ✅ NEW

DualTranscriptService (services/dual_transcript_service.py)

Orchestrates between YouTube captions and Whisper AI transcription
Supports three extraction modes: youtube, whisper, both
Parallel processing for comparison mode with real-time progress updates
Advanced quality comparison with punctuation/capitalization analysis
Processing time estimation and intelligent recommendation engine
Seamless integration with existing TranscriptService

FasterWhisperTranscriptService (services/faster_whisper_transcript_service.py) ✅ UPGRADED

20-32x Speed Improvement: Powered by faster-whisper (CTranslate2 optimization engine)
Large-v3-Turbo Model: Best accuracy/speed balance with advanced AI capabilities
Intelligent Optimizations: Voice Activity Detection (VAD), int8 quantization, GPU acceleration
Native MP3 Support: No audio conversion needed, direct processing
Advanced Configuration: Fully configurable via VideoDownloadConfig with environment variables
Production Features: Async processing, intelligent chunking, comprehensive metadata
Performance Metrics: Real-time speed ratios, processing time tracking, quality scoring

Core Pipeline Services

IntelligentVideoDownloader (services/intelligent_video_downloader.py) ✅ NEW

9-Tier Transcript Extraction Fallback Chain:
1. YouTube Transcript API - Primary method using official API
2. Auto-generated Captions - YouTube's automatic captions fallback
3. Whisper AI Transcription - OpenAI Whisper for high-quality audio transcription
4. PyTubeFix Downloader - Alternative YouTube library
5. YT-DLP Downloader - Robust video/audio extraction tool
6. Playwright Browser - Browser automation for JavaScript-rendered content
7. External Tools - 4K Video Downloader CLI integration
8. Web Services - Third-party transcript API services
9. Transcript-Only - Metadata without full transcript as final fallback
Audio Retention System for re-transcription capability
Intelligent method selection based on success rates
Comprehensive error handling with detailed logging
Performance telemetry and health monitoring

SummaryPipeline (services/summary_pipeline.py)

Main orchestration service for end-to-end video processing
7-stage async pipeline: URL validation → metadata extraction → transcript → analysis → summarization → quality validation → completion
Integrates with IntelligentVideoDownloader for robust transcript extraction
Intelligent content analysis and configuration optimization
Real-time progress tracking via WebSocket
Automatic retry logic with exponential backoff
Quality scoring and validation system

AnthropicSummarizer (services/anthropic_summarizer.py)

AI service integration using Claude 3.5 Haiku for cost efficiency
Structured JSON output with fallback text parsing
Token counting and cost estimation
Intelligent chunking for long transcripts (up to 200k context)
Comprehensive error handling and retry logic

CacheManager (services/cache_manager.py)

Multi-level caching for pipeline results, transcripts, and metadata
TTL-based expiration with automatic cleanup
Redis-ready architecture for production scaling
Configurable cache keys with collision prevention

WebSocketManager (core/websocket_manager.py)

Singleton pattern for WebSocket connection management
Job-specific connection tracking and broadcasting
Real-time progress updates and completion notifications
Heartbeat mechanism and stale connection cleanup

NotificationService (services/notification_service.py)

Multi-type notifications (completion, error, progress, system)
Notification history and statistics tracking
Email/webhook integration ready architecture
Configurable filtering and management

API Layer

Pipeline API (api/pipeline.py)

Complete pipeline management endpoints
Process video with configuration options
Status monitoring and job history
Pipeline cancellation and cleanup
Health checks and system statistics

Summarization API (api/summarization.py)

Direct AI summarization endpoints
Sync and async processing options
Cost estimation and validation
Background job management

Dual Transcript API (api/transcripts.py) ✅ NEW

POST /api/transcripts/dual/extract - Start dual transcript extraction
GET /api/transcripts/dual/jobs/{job_id} - Monitor extraction progress
POST /api/transcripts/dual/estimate - Get processing time estimates
GET /api/transcripts/dual/compare/{video_id} - Force comparison analysis
Background job processing with real-time progress updates
YouTube captions, Whisper AI, or both sources simultaneously

Development Patterns

Service Dependency Injection

def get_summary_pipeline(
    video_service: VideoService = Depends(get_video_service),
    transcript_service: TranscriptService = Depends(get_transcript_service),
    ai_service: AnthropicSummarizer = Depends(get_ai_service),
    cache_manager: CacheManager = Depends(get_cache_manager),
    notification_service: NotificationService = Depends(get_notification_service)
) -> SummaryPipeline:
    return SummaryPipeline(...)

Database Registry Pattern (CRITICAL ARCHITECTURE)

Problem Solved: SQLAlchemy "Multiple classes found for path" relationship resolver errors

# Always use the registry for model creation
from backend.core.database_registry import registry
from backend.models.base import Model

# Models inherit from Model (which uses registry.Base)
class User(Model):
    __tablename__ = "users"
    # Use fully qualified relationship paths to prevent conflicts
    summaries = relationship("backend.models.summary.Summary", back_populates="user")

# Registry ensures single Base instance and safe model registration
registry.create_all_tables(engine)  # For table creation
registry.register_model(ModelClass)  # Automatic via BaseModel mixin

Key Benefits:

Prevents SQLAlchemy table redefinition conflicts
Thread-safe singleton pattern
Automatic model registration and deduplication
Production-ready architecture
Clean testing with registry reset capabilities

Authentication Pattern

# Protected endpoint with user dependency
@router.post("/api/protected")
async def protected_endpoint(
    current_user: User = Depends(get_current_user),
    db: Session = Depends(get_db)
):
    return {"user_id": current_user.id}

# JWT token validation and refresh
from backend.services.auth_service import AuthService
auth_service = AuthService()
user = await auth_service.authenticate_user(email, password)
tokens = auth_service.create_access_token(user)

Async Pipeline Pattern

async def process_video(self, video_url: str, config: PipelineConfig = None) -> str:
    job_id = str(uuid.uuid4())
    result = PipelineResult(job_id=job_id, video_url=video_url, ...)
    self.active_jobs[job_id] = result
    
    # Start background processing
    asyncio.create_task(self._execute_pipeline(job_id, config))
    return job_id

Error Handling Pattern

try:
    result = await self.ai_service.generate_summary(request)
except AIServiceError as e:
    raise HTTPException(status_code=500, detail={
        "error": "AI service error",
        "message": e.message,
        "code": e.error_code
    })

Configuration and Environment

Required Environment Variables

# Core Services
ANTHROPIC_API_KEY=sk-ant-...           # Required for AI summarization
YOUTUBE_API_KEY=AIza...                # YouTube Data API v3 key
GOOGLE_API_KEY=AIza...                 # Google/Gemini API key

# Feature Flags
USE_MOCK_SERVICES=false                # Disable mock services
ENABLE_REAL_TRANSCRIPT_EXTRACTION=true # Enable real transcript extraction

# Video Download & Storage Configuration
VIDEO_DOWNLOAD_STORAGE_PATH=./video_storage     # Base storage directory
VIDEO_DOWNLOAD_KEEP_AUDIO_FILES=true           # Save audio for re-transcription
VIDEO_DOWNLOAD_AUDIO_CLEANUP_DAYS=30           # Audio retention period
VIDEO_DOWNLOAD_MAX_STORAGE_GB=10               # Storage limit

# Faster-Whisper Configuration (20-32x Speed Improvement)
VIDEO_DOWNLOAD_WHISPER_MODEL=large-v3-turbo    # Model: 'large-v3-turbo', 'large-v3', 'medium', 'small', 'base'
VIDEO_DOWNLOAD_WHISPER_DEVICE=auto             # Device: 'auto', 'cpu', 'cuda'
VIDEO_DOWNLOAD_WHISPER_COMPUTE_TYPE=auto       # Compute: 'auto', 'int8', 'float16', 'float32'
VIDEO_DOWNLOAD_WHISPER_BEAM_SIZE=5             # Beam search size (1-10, higher = better quality)
VIDEO_DOWNLOAD_WHISPER_VAD_FILTER=true         # Voice Activity Detection (efficiency)
VIDEO_DOWNLOAD_WHISPER_WORD_TIMESTAMPS=true    # Word-level timestamps
VIDEO_DOWNLOAD_WHISPER_TEMPERATURE=0.0         # Sampling temperature (0 = deterministic)
VIDEO_DOWNLOAD_WHISPER_BEST_OF=5               # Number of candidates when sampling

# Dependencies: faster-whisper automatically handles dependencies
# pip install faster-whisper torch pydub yt-dlp pytubefix
# GPU acceleration: CUDA automatically detected and used when available

# Optional Configuration
DATABASE_URL=sqlite:///./data/app.db   # Database connection
REDIS_URL=redis://localhost:6379/0    # Cache backend (optional)
LOG_LEVEL=INFO                         # Logging level
CORS_ORIGINS=http://localhost:3000     # Frontend origins

Service Configuration

Services are configured through dependency injection with sensible defaults:

# Cost-optimized AI model
ai_service = AnthropicSummarizer(
    api_key=api_key, 
    model="claude-3-5-haiku-20241022"  # Cost-effective choice
)

# Cache with TTL
cache_manager = CacheManager(default_ttl=3600)  # 1 hour default

# Pipeline with retry logic
config = PipelineConfig(
    summary_length="standard",
    quality_threshold=0.7,
    max_retries=2,
    enable_notifications=True
)

Testing Strategy

Unit Tests

Location: tests/unit/
Coverage: 17+ tests for pipeline orchestration
Mocking: All external services mocked
Patterns: Async test patterns with proper fixtures

Integration Tests

Location: tests/integration/
Coverage: 20+ API endpoint scenarios
Testing: Full FastAPI integration with TestClient
Validation: Request/response validation and error handling

Running Tests

# From backend directory
PYTHONPATH=/path/to/youtube-summarizer python3 -m pytest tests/unit/ -v
PYTHONPATH=/path/to/youtube-summarizer python3 -m pytest tests/integration/ -v

# With coverage
python3 -m pytest tests/ --cov=backend --cov-report=html

Common Development Tasks

Adding New API Endpoints

Create endpoint in appropriate api/ module
Add business logic to services/ layer
Update main.py to include router
Add unit and integration tests
Update API documentation

Adding New Services

Create service class in services/
Implement proper async patterns
Add error handling with custom exceptions
Create dependency injection function
Add comprehensive unit tests

Debugging Pipeline Issues

# Enable detailed logging
import logging
logging.getLogger("backend").setLevel(logging.DEBUG)

# Check pipeline status
pipeline = get_summary_pipeline()
result = await pipeline.get_pipeline_result(job_id)
print(f"Status: {result.status}, Error: {result.error}")

# Monitor active jobs
active_jobs = pipeline.get_active_jobs()
print(f"Active jobs: {len(active_jobs)}")

Performance Optimization

Faster-Whisper Performance (✅ MAJOR UPGRADE)

20-32x Speed Improvement: CTranslate2 optimization engine provides massive speed gains
Large-v3-Turbo Model: Combines best accuracy with 5-8x additional speed over large-v3
Intelligent Processing: Voice Activity Detection reduces processing time by filtering silence
CPU Optimization: int8 quantization provides excellent performance even without GPU
GPU Acceleration: Automatic CUDA detection and utilization when available
Native MP3: Direct processing without audio conversion overhead
Real-time Performance: Typical 2-3x faster than realtime processing speeds

Benchmark Results (3.6 minute video):

Processing Time: 94 seconds (vs ~30+ minutes with OpenAI Whisper)
Quality Score: 1.000 (perfect transcription accuracy)
Confidence Score: 0.962 (very high confidence)
Speed Ratio: 2.3x faster than realtime

Async Patterns

All I/O operations use async/await
Background tasks for long-running operations
Connection pooling for external services
Proper exception handling to prevent blocking

Caching Strategy

Pipeline results cached for 1 hour
Transcript and metadata cached separately
Cache invalidation on video updates
Redis-ready for distributed caching

Cost Optimization

Claude 3.5 Haiku for 80% cost savings vs GPT-4
Intelligent chunking prevents token waste
Cost estimation and limits
Quality scoring to avoid unnecessary retries

Security Considerations

API Security

Environment variable for API keys
Input validation on all endpoints
Rate limiting (implement with Redis)
CORS configuration for frontend origins

Error Sanitization

# Never expose internal errors to clients
except Exception as e:
    logger.error(f"Internal error: {e}")
    raise HTTPException(status_code=500, detail="Internal server error")

Content Validation

# Validate transcript length
if len(request.transcript.strip()) < 50:
    raise HTTPException(status_code=400, detail="Transcript too short")

Monitoring and Observability

Health Checks

/api/health - Service health status
/api/stats - Pipeline processing statistics
WebSocket connection monitoring
Background job tracking

Logging

Structured logging with JSON format
Error tracking with context
Performance metrics logging
Request/response logging (without sensitive data)

Metrics

# Built-in metrics
stats = {
    "active_jobs": len(pipeline.get_active_jobs()),
    "cache_stats": await cache_manager.get_cache_stats(),
    "notification_stats": notification_service.get_notification_stats(),
    "websocket_connections": websocket_manager.get_stats()
}

Deployment Considerations

Production Configuration

Use Redis for caching and session storage
Configure proper logging (structured JSON)
Set up health checks and monitoring
Use environment-specific configuration
Enable HTTPS and security headers

Scaling Patterns

Stateless design enables horizontal scaling
Background job processing via task queue
Database connection pooling
Load balancer health checks

Database Migrations & Epic 4 Features

Current Status: ✅ Epic 4 migration complete (add_epic_4_features)

Database Schema: 21 tables including Epic 4 features:

Multi-Agent Tables: agent_summaries, prompt_templates
Enhanced Export Tables: export_metadata, summary_sections
RAG Chat Tables: chat_sessions, chat_messages, video_chunks
Analytics Tables: playlist_analysis, rag_analytics, prompt_experiments

Migration Commands:

# Check migration status
python3 ../../scripts/utilities/migration_manager.py status

# Apply migrations (from backend directory)
PYTHONPATH=/Users/enias/projects/my-ai-projects/apps/youtube-summarizer \
  ../venv/bin/python3 -m alembic upgrade head

# Create new migration
python3 -m alembic revision --autogenerate -m "Add new feature"

Python 3.11 Requirement: Epic 4 requires Python 3.11+ for:

chromadb: Vector database for RAG functionality
sentence-transformers: Embedding generation for semantic search
aiohttp: Async HTTP client for DeepSeek API integration

Environment Setup:

# Remove old environment if needed
rm -rf venv

# Create Python 3.11 virtual environment
/opt/homebrew/bin/python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Install Epic 4 dependencies
pip install chromadb sentence-transformers aiohttp

# Verify installation
python --version  # Should show Python 3.11.x

Troubleshooting

Common Issues

"Pydantic validation error: Extra inputs are not permitted"

Issue: Environment variables not defined in Settings model
Solution: Add extra = "ignore" to Config class in core/config.py

"Table already exists" during migration

Issue: Database already has tables that migration tries to create
Solution: Use alembic stamp existing_revision then alembic upgrade head

"Multiple head revisions present"

Issue: Multiple migration branches need merging
Solution: Use alembic merge head1 head2 -m "Merge branches"

"Python 3.9 compatibility issues with Epic 4"

Issue: ChromaDB and modern AI libraries require Python 3.11+
Solution: Recreate virtual environment with Python 3.11 (see Environment Setup above)

"Anthropic API key not configured"

Solution: Set ANTHROPIC_API_KEY environment variable

"Mock data returned instead of real transcripts"

Check: USE_MOCK_SERVICES=false in .env
Solution: Set ENABLE_REAL_TRANSCRIPT_EXTRACTION=true

"404 Not Found for /api/transcripts/extract"

Check: Import statements in main.py
Solution: Use from backend.api.transcripts import router (not transcripts_stub)

"Radio button selection not working"

Issue: Circular state updates in React
Solution: Use ref tracking in useTranscriptSelector hook

"VAD filter removes all audio / 0 segments generated"

Issue: Voice Activity Detection too aggressive for music/instrumental content
Solution: Set VIDEO_DOWNLOAD_WHISPER_VAD_FILTER=false for music videos
Alternative: Use whisper_vad_filter=False in service configuration

"Faster-whisper model download fails"

Issue: Network issues downloading large-v3-turbo model from HuggingFace
Solution: Model will automatically fallback to standard large-v3
Check: Ensure internet connection for initial model download

"CPU transcription too slow"

Issue: CPU-only processing on large models
Solution: Use smaller model (base or small) or enable GPU acceleration
Config: VIDEO_DOWNLOAD_WHISPER_MODEL=base for faster CPU processing

Pipeline jobs stuck in "processing" state

Check: pipeline.get_active_jobs() for zombie jobs
Solution: Restart service or call cleanup endpoint

WebSocket connections not receiving updates

Check: WebSocket connection in browser dev tools
Solution: Verify WebSocket manager singleton initialization

High AI costs

Check: Summary length configuration and transcript sizes
Solution: Implement cost limits and brief summary defaults

Transcript extraction failures

Check: IntelligentVideoDownloader fallback chain logs
Solution: Review which tier failed and check API keys/dependencies

Debug Commands

# Pipeline debugging
from backend.services.summary_pipeline import SummaryPipeline
pipeline = SummaryPipeline(...)
result = await pipeline.get_pipeline_result("job_id")

# Cache debugging
from backend.services.cache_manager import CacheManager
cache = CacheManager()
stats = await cache.get_cache_stats()

# WebSocket debugging
from backend.core.websocket_manager import websocket_manager
connections = websocket_manager.get_stats()

This backend is designed for production use with comprehensive error handling, monitoring, and scalability patterns. All services follow async patterns and clean architecture principles.

21 KiB Raw Blame History