17 KiB

Raw Blame History

CLAUDE.md - YouTube Summarizer Backend

This file provides guidance to Claude Code when working with the YouTube Summarizer backend services.

Backend Architecture Overview

The backend is built with FastAPI and follows a clean architecture pattern with clear separation of concerns:

backend/
├── api/                    # API endpoints and request/response models
├── services/              # Business logic and external integrations
├── models/                # Data models and database schemas
├── core/                  # Core utilities, exceptions, and configurations
└── tests/                 # Unit and integration tests

Key Services and Components

Authentication System (Story 3.1 - COMPLETE ✅)

Architecture: Production-ready JWT-based authentication with Database Registry singleton pattern

AuthService (services/auth_service.py)

JWT token generation and validation (access + refresh tokens)
Password hashing with bcrypt and strength validation
User registration with email verification workflow
Password reset with secure token generation
Session management and token refresh logic

Database Registry Pattern (core/database_registry.py)

CRITICAL FIX: Resolves SQLAlchemy "Multiple classes found for path" errors
Singleton pattern ensuring single Base instance across application
Automatic model registration preventing table redefinition conflicts
Thread-safe model management with registry cleanup for testing
Production-ready architecture preventing relationship resolver issues

Authentication Models (models/user.py)

User, RefreshToken, APIKey, EmailVerificationToken, PasswordResetToken
Fully qualified relationship paths preventing SQLAlchemy conflicts
String UUID fields for SQLite compatibility
Proper model inheritance using Database Registry Base

Authentication API (api/auth.py)

Complete endpoint coverage: register, login, logout, refresh, verify email, reset password
Comprehensive input validation and error handling
Protected route dependencies and middleware
Async/await patterns throughout

Dual Transcript Services ✅ NEW

DualTranscriptService (services/dual_transcript_service.py)

Orchestrates between YouTube captions and Whisper AI transcription
Supports three extraction modes: youtube, whisper, both
Parallel processing for comparison mode with real-time progress updates
Advanced quality comparison with punctuation/capitalization analysis
Processing time estimation and intelligent recommendation engine
Seamless integration with existing TranscriptService

WhisperTranscriptService (services/whisper_transcript_service.py)

OpenAI Whisper integration for high-quality YouTube video transcription
Async audio download via yt-dlp with automatic cleanup
Intelligent chunking for long videos (30-minute segments with overlap)
Device detection (CPU/CUDA) for optimal performance
Quality and confidence scoring algorithms
Production-ready error handling and resource management

Core Pipeline Services

IntelligentVideoDownloader (services/intelligent_video_downloader.py) ✅ NEW

9-Tier Transcript Extraction Fallback Chain:
1. YouTube Transcript API - Primary method using official API
2. Auto-generated Captions - YouTube's automatic captions fallback
3. Whisper AI Transcription - OpenAI Whisper for high-quality audio transcription
4. PyTubeFix Downloader - Alternative YouTube library
5. YT-DLP Downloader - Robust video/audio extraction tool
6. Playwright Browser - Browser automation for JavaScript-rendered content
7. External Tools - 4K Video Downloader CLI integration
8. Web Services - Third-party transcript API services
9. Transcript-Only - Metadata without full transcript as final fallback
Audio Retention System for re-transcription capability
Intelligent method selection based on success rates
Comprehensive error handling with detailed logging
Performance telemetry and health monitoring

SummaryPipeline (services/summary_pipeline.py)

Main orchestration service for end-to-end video processing
7-stage async pipeline: URL validation → metadata extraction → transcript → analysis → summarization → quality validation → completion
Integrates with IntelligentVideoDownloader for robust transcript extraction
Intelligent content analysis and configuration optimization
Real-time progress tracking via WebSocket
Automatic retry logic with exponential backoff
Quality scoring and validation system

AnthropicSummarizer (services/anthropic_summarizer.py)

AI service integration using Claude 3.5 Haiku for cost efficiency
Structured JSON output with fallback text parsing
Token counting and cost estimation
Intelligent chunking for long transcripts (up to 200k context)
Comprehensive error handling and retry logic

CacheManager (services/cache_manager.py)

Multi-level caching for pipeline results, transcripts, and metadata
TTL-based expiration with automatic cleanup
Redis-ready architecture for production scaling
Configurable cache keys with collision prevention

WebSocketManager (core/websocket_manager.py)

Singleton pattern for WebSocket connection management
Job-specific connection tracking and broadcasting
Real-time progress updates and completion notifications
Heartbeat mechanism and stale connection cleanup

NotificationService (services/notification_service.py)

Multi-type notifications (completion, error, progress, system)
Notification history and statistics tracking
Email/webhook integration ready architecture
Configurable filtering and management

API Layer

Pipeline API (api/pipeline.py)

Complete pipeline management endpoints
Process video with configuration options
Status monitoring and job history
Pipeline cancellation and cleanup
Health checks and system statistics

Summarization API (api/summarization.py)

Direct AI summarization endpoints
Sync and async processing options
Cost estimation and validation
Background job management

Dual Transcript API (api/transcripts.py) ✅ NEW

POST /api/transcripts/dual/extract - Start dual transcript extraction
GET /api/transcripts/dual/jobs/{job_id} - Monitor extraction progress
POST /api/transcripts/dual/estimate - Get processing time estimates
GET /api/transcripts/dual/compare/{video_id} - Force comparison analysis
Background job processing with real-time progress updates
YouTube captions, Whisper AI, or both sources simultaneously

Development Patterns

Service Dependency Injection

def get_summary_pipeline(
    video_service: VideoService = Depends(get_video_service),
    transcript_service: TranscriptService = Depends(get_transcript_service),
    ai_service: AnthropicSummarizer = Depends(get_ai_service),
    cache_manager: CacheManager = Depends(get_cache_manager),
    notification_service: NotificationService = Depends(get_notification_service)
) -> SummaryPipeline:
    return SummaryPipeline(...)

Database Registry Pattern (CRITICAL ARCHITECTURE)

Problem Solved: SQLAlchemy "Multiple classes found for path" relationship resolver errors

# Always use the registry for model creation
from backend.core.database_registry import registry
from backend.models.base import Model

# Models inherit from Model (which uses registry.Base)
class User(Model):
    __tablename__ = "users"
    # Use fully qualified relationship paths to prevent conflicts
    summaries = relationship("backend.models.summary.Summary", back_populates="user")

# Registry ensures single Base instance and safe model registration
registry.create_all_tables(engine)  # For table creation
registry.register_model(ModelClass)  # Automatic via BaseModel mixin

Key Benefits:

Prevents SQLAlchemy table redefinition conflicts
Thread-safe singleton pattern
Automatic model registration and deduplication
Production-ready architecture
Clean testing with registry reset capabilities

Authentication Pattern

# Protected endpoint with user dependency
@router.post("/api/protected")
async def protected_endpoint(
    current_user: User = Depends(get_current_user),
    db: Session = Depends(get_db)
):
    return {"user_id": current_user.id}

# JWT token validation and refresh
from backend.services.auth_service import AuthService
auth_service = AuthService()
user = await auth_service.authenticate_user(email, password)
tokens = auth_service.create_access_token(user)

Async Pipeline Pattern

async def process_video(self, video_url: str, config: PipelineConfig = None) -> str:
    job_id = str(uuid.uuid4())
    result = PipelineResult(job_id=job_id, video_url=video_url, ...)
    self.active_jobs[job_id] = result
    
    # Start background processing
    asyncio.create_task(self._execute_pipeline(job_id, config))
    return job_id

Error Handling Pattern

try:
    result = await self.ai_service.generate_summary(request)
except AIServiceError as e:
    raise HTTPException(status_code=500, detail={
        "error": "AI service error",
        "message": e.message,
        "code": e.error_code
    })

Configuration and Environment

Required Environment Variables

# Core Services
ANTHROPIC_API_KEY=sk-ant-...           # Required for AI summarization
YOUTUBE_API_KEY=AIza...                # YouTube Data API v3 key
GOOGLE_API_KEY=AIza...                 # Google/Gemini API key

# Feature Flags
USE_MOCK_SERVICES=false                # Disable mock services
ENABLE_REAL_TRANSCRIPT_EXTRACTION=true # Enable real transcript extraction

# Video Download & Storage Configuration
VIDEO_DOWNLOAD_STORAGE_PATH=./video_storage     # Base storage directory
VIDEO_DOWNLOAD_KEEP_AUDIO_FILES=true           # Save audio for re-transcription
VIDEO_DOWNLOAD_AUDIO_CLEANUP_DAYS=30           # Audio retention period
VIDEO_DOWNLOAD_MAX_STORAGE_GB=10               # Storage limit

# Dual Transcript Configuration
# Whisper AI transcription requires additional dependencies:
# pip install torch whisper pydub yt-dlp pytubefix
# Optional: CUDA for GPU acceleration

# Optional Configuration
DATABASE_URL=sqlite:///./data/app.db   # Database connection
REDIS_URL=redis://localhost:6379/0    # Cache backend (optional)
LOG_LEVEL=INFO                         # Logging level
CORS_ORIGINS=http://localhost:3000     # Frontend origins

Service Configuration

Services are configured through dependency injection with sensible defaults:

# Cost-optimized AI model
ai_service = AnthropicSummarizer(
    api_key=api_key, 
    model="claude-3-5-haiku-20241022"  # Cost-effective choice
)

# Cache with TTL
cache_manager = CacheManager(default_ttl=3600)  # 1 hour default

# Pipeline with retry logic
config = PipelineConfig(
    summary_length="standard",
    quality_threshold=0.7,
    max_retries=2,
    enable_notifications=True
)

Testing Strategy

Unit Tests

Location: tests/unit/
Coverage: 17+ tests for pipeline orchestration
Mocking: All external services mocked
Patterns: Async test patterns with proper fixtures

Integration Tests

Location: tests/integration/
Coverage: 20+ API endpoint scenarios
Testing: Full FastAPI integration with TestClient
Validation: Request/response validation and error handling

Running Tests

# From backend directory
PYTHONPATH=/path/to/youtube-summarizer python3 -m pytest tests/unit/ -v
PYTHONPATH=/path/to/youtube-summarizer python3 -m pytest tests/integration/ -v

# With coverage
python3 -m pytest tests/ --cov=backend --cov-report=html

Common Development Tasks

Adding New API Endpoints

Create endpoint in appropriate api/ module
Add business logic to services/ layer
Update main.py to include router
Add unit and integration tests
Update API documentation

Adding New Services

Create service class in services/
Implement proper async patterns
Add error handling with custom exceptions
Create dependency injection function
Add comprehensive unit tests

Debugging Pipeline Issues

# Enable detailed logging
import logging
logging.getLogger("backend").setLevel(logging.DEBUG)

# Check pipeline status
pipeline = get_summary_pipeline()
result = await pipeline.get_pipeline_result(job_id)
print(f"Status: {result.status}, Error: {result.error}")

# Monitor active jobs
active_jobs = pipeline.get_active_jobs()
print(f"Active jobs: {len(active_jobs)}")

Performance Optimization

Async Patterns

All I/O operations use async/await
Background tasks for long-running operations
Connection pooling for external services
Proper exception handling to prevent blocking

Caching Strategy

Pipeline results cached for 1 hour
Transcript and metadata cached separately
Cache invalidation on video updates
Redis-ready for distributed caching

Cost Optimization

Claude 3.5 Haiku for 80% cost savings vs GPT-4
Intelligent chunking prevents token waste
Cost estimation and limits
Quality scoring to avoid unnecessary retries

Security Considerations

API Security

Environment variable for API keys
Input validation on all endpoints
Rate limiting (implement with Redis)
CORS configuration for frontend origins

Error Sanitization

# Never expose internal errors to clients
except Exception as e:
    logger.error(f"Internal error: {e}")
    raise HTTPException(status_code=500, detail="Internal server error")

Content Validation

# Validate transcript length
if len(request.transcript.strip()) < 50:
    raise HTTPException(status_code=400, detail="Transcript too short")

Monitoring and Observability

Health Checks

/api/health - Service health status
/api/stats - Pipeline processing statistics
WebSocket connection monitoring
Background job tracking

Logging

Structured logging with JSON format
Error tracking with context
Performance metrics logging
Request/response logging (without sensitive data)

Metrics

# Built-in metrics
stats = {
    "active_jobs": len(pipeline.get_active_jobs()),
    "cache_stats": await cache_manager.get_cache_stats(),
    "notification_stats": notification_service.get_notification_stats(),
    "websocket_connections": websocket_manager.get_stats()
}

Deployment Considerations

Production Configuration

Use Redis for caching and session storage
Configure proper logging (structured JSON)
Set up health checks and monitoring
Use environment-specific configuration
Enable HTTPS and security headers

Scaling Patterns

Stateless design enables horizontal scaling
Background job processing via task queue
Database connection pooling
Load balancer health checks

Database Migrations

# When adding database models
alembic revision --autogenerate -m "Add pipeline models"
alembic upgrade head

Troubleshooting

Common Issues

"Anthropic API key not configured"

Solution: Set ANTHROPIC_API_KEY environment variable

"Mock data returned instead of real transcripts"

Check: USE_MOCK_SERVICES=false in .env
Solution: Set ENABLE_REAL_TRANSCRIPT_EXTRACTION=true

"404 Not Found for /api/transcripts/extract"

Check: Import statements in main.py
Solution: Use from backend.api.transcripts import router (not transcripts_stub)

"Radio button selection not working"

Issue: Circular state updates in React
Solution: Use ref tracking in useTranscriptSelector hook

Pipeline jobs stuck in "processing" state

Check: pipeline.get_active_jobs() for zombie jobs
Solution: Restart service or call cleanup endpoint

WebSocket connections not receiving updates

Check: WebSocket connection in browser dev tools
Solution: Verify WebSocket manager singleton initialization

High AI costs

Check: Summary length configuration and transcript sizes
Solution: Implement cost limits and brief summary defaults

Transcript extraction failures

Check: IntelligentVideoDownloader fallback chain logs
Solution: Review which tier failed and check API keys/dependencies

Debug Commands

# Pipeline debugging
from backend.services.summary_pipeline import SummaryPipeline
pipeline = SummaryPipeline(...)
result = await pipeline.get_pipeline_result("job_id")

# Cache debugging
from backend.services.cache_manager import CacheManager
cache = CacheManager()
stats = await cache.get_cache_stats()

# WebSocket debugging
from backend.core.websocket_manager import websocket_manager
connections = websocket_manager.get_stats()

This backend is designed for production use with comprehensive error handling, monitoring, and scalability patterns. All services follow async patterns and clean architecture principles.

17 KiB Raw Blame History