youtube-summarizer/backend/CLAUDE.md

# CLAUDE.md - YouTube Summarizer Backend

This file provides guidance to Claude Code when working with the YouTube Summarizer backend services.

## Backend Architecture Overview

The backend is built with FastAPI and follows a clean architecture pattern with clear separation of concerns:

```
backend/
├── api/                    # API endpoints and request/response models
├── services/              # Business logic and external integrations
├── models/                # Data models and database schemas
├── core/                  # Core utilities, exceptions, and configurations
└── tests/                 # Unit and integration tests
```

## Key Services and Components

### Authentication System (Story 3.1 - COMPLETE ✅)

**Architecture**: Production-ready JWT-based authentication with Database Registry singleton pattern

**AuthService** (`services/auth_service.py`)
- JWT token generation and validation (access + refresh tokens)
- Password hashing with bcrypt and strength validation
- User registration with email verification workflow
- Password reset with secure token generation
- Session management and token refresh logic

**Database Registry Pattern** (`core/database_registry.py`)
- **CRITICAL FIX**: Resolves SQLAlchemy "Multiple classes found for path" errors
- Singleton pattern ensuring single Base instance across application
- Automatic model registration preventing table redefinition conflicts
- Thread-safe model management with registry cleanup for testing
- Production-ready architecture preventing relationship resolver issues

**Authentication Models** (`models/user.py`)
- User, RefreshToken, APIKey, EmailVerificationToken, PasswordResetToken
- Fully qualified relationship paths preventing SQLAlchemy conflicts
- String UUID fields for SQLite compatibility
- Proper model inheritance using Database Registry Base

**Authentication API** (`api/auth.py`)
- Complete endpoint coverage: register, login, logout, refresh, verify email, reset password
- Comprehensive input validation and error handling
- Protected route dependencies and middleware
- Async/await patterns throughout

### Dual Transcript Services ✅ **NEW**

**DualTranscriptService** (`services/dual_transcript_service.py`)
- Orchestrates between YouTube captions and Whisper AI transcription
- Supports three extraction modes: `youtube`, `whisper`, `both`
- Parallel processing for comparison mode with real-time progress updates
- Advanced quality comparison with punctuation/capitalization analysis
- Processing time estimation and intelligent recommendation engine
- Seamless integration with existing TranscriptService

**WhisperTranscriptService** (`services/whisper_transcript_service.py`)
- OpenAI Whisper integration for high-quality YouTube video transcription
- Async audio download via yt-dlp with automatic cleanup
- Intelligent chunking for long videos (30-minute segments with overlap)
- Device detection (CPU/CUDA) for optimal performance
- Quality and confidence scoring algorithms
- Production-ready error handling and resource management

### Core Pipeline Services

**IntelligentVideoDownloader** (`services/intelligent_video_downloader.py`) ✅ **NEW**
- **9-Tier Transcript Extraction Fallback Chain**:
  1. YouTube Transcript API - Primary method using official API
  2. Auto-generated Captions - YouTube's automatic captions fallback
  3. Whisper AI Transcription - OpenAI Whisper for high-quality audio transcription
  4. PyTubeFix Downloader - Alternative YouTube library
  5. YT-DLP Downloader - Robust video/audio extraction tool
  6. Playwright Browser - Browser automation for JavaScript-rendered content
  7. External Tools - 4K Video Downloader CLI integration
  8. Web Services - Third-party transcript API services
  9. Transcript-Only - Metadata without full transcript as final fallback
- **Audio Retention System** for re-transcription capability
- **Intelligent method selection** based on success rates
- **Comprehensive error handling** with detailed logging
- **Performance telemetry** and health monitoring

**SummaryPipeline** (`services/summary_pipeline.py`)
- Main orchestration service for end-to-end video processing
- 7-stage async pipeline: URL validation → metadata extraction → transcript → analysis → summarization → quality validation → completion
- Integrates with IntelligentVideoDownloader for robust transcript extraction
- Intelligent content analysis and configuration optimization
- Real-time progress tracking via WebSocket
- Automatic retry logic with exponential backoff
- Quality scoring and validation system

**AnthropicSummarizer** (`services/anthropic_summarizer.py`)
- AI service integration using Claude 3.5 Haiku for cost efficiency
- Structured JSON output with fallback text parsing
- Token counting and cost estimation
- Intelligent chunking for long transcripts (up to 200k context)
- Comprehensive error handling and retry logic

**CacheManager** (`services/cache_manager.py`)
- Multi-level caching for pipeline results, transcripts, and metadata
- TTL-based expiration with automatic cleanup
- Redis-ready architecture for production scaling
- Configurable cache keys with collision prevention

**WebSocketManager** (`core/websocket_manager.py`)
- Singleton pattern for WebSocket connection management
- Job-specific connection tracking and broadcasting
- Real-time progress updates and completion notifications
- Heartbeat mechanism and stale connection cleanup

**NotificationService** (`services/notification_service.py`)
- Multi-type notifications (completion, error, progress, system)
- Notification history and statistics tracking
- Email/webhook integration ready architecture
- Configurable filtering and management

### API Layer

**Pipeline API** (`api/pipeline.py`)
- Complete pipeline management endpoints
- Process video with configuration options
- Status monitoring and job history
- Pipeline cancellation and cleanup
- Health checks and system statistics

**Summarization API** (`api/summarization.py`)
- Direct AI summarization endpoints
- Sync and async processing options
- Cost estimation and validation
- Background job management

**Dual Transcript API** (`api/transcripts.py`) ✅ **NEW**
- `POST /api/transcripts/dual/extract` - Start dual transcript extraction
- `GET /api/transcripts/dual/jobs/{job_id}` - Monitor extraction progress
- `POST /api/transcripts/dual/estimate` - Get processing time estimates
- `GET /api/transcripts/dual/compare/{video_id}` - Force comparison analysis
- Background job processing with real-time progress updates
- YouTube captions, Whisper AI, or both sources simultaneously

## Development Patterns

### Service Dependency Injection

```python
def get_summary_pipeline(
    video_service: VideoService = Depends(get_video_service),
    transcript_service: TranscriptService = Depends(get_transcript_service),
    ai_service: AnthropicSummarizer = Depends(get_ai_service),
    cache_manager: CacheManager = Depends(get_cache_manager),
    notification_service: NotificationService = Depends(get_notification_service)
) -> SummaryPipeline:
    return SummaryPipeline(...)
```

### Database Registry Pattern (CRITICAL ARCHITECTURE)

**Problem Solved**: SQLAlchemy "Multiple classes found for path" relationship resolver errors

```python
# Always use the registry for model creation
from backend.core.database_registry import registry
from backend.models.base import Model

# Models inherit from Model (which uses registry.Base)
class User(Model):
    __tablename__ = "users"
    # Use fully qualified relationship paths to prevent conflicts
    summaries = relationship("backend.models.summary.Summary", back_populates="user")

# Registry ensures single Base instance and safe model registration
registry.create_all_tables(engine)  # For table creation
registry.register_model(ModelClass)  # Automatic via BaseModel mixin
```

**Key Benefits**:
- Prevents SQLAlchemy table redefinition conflicts
- Thread-safe singleton pattern
- Automatic model registration and deduplication
- Production-ready architecture
- Clean testing with registry reset capabilities

### Authentication Pattern

```python
# Protected endpoint with user dependency
@router.post("/api/protected")
async def protected_endpoint(
    current_user: User = Depends(get_current_user),
    db: Session = Depends(get_db)
):
    return {"user_id": current_user.id}

# JWT token validation and refresh
from backend.services.auth_service import AuthService
auth_service = AuthService()
user = await auth_service.authenticate_user(email, password)
tokens = auth_service.create_access_token(user)
```

### Async Pipeline Pattern

```python
async def process_video(self, video_url: str, config: PipelineConfig = None) -> str:
    job_id = str(uuid.uuid4())
    result = PipelineResult(job_id=job_id, video_url=video_url, ...)
    self.active_jobs[job_id] = result

    # Start background processing
    asyncio.create_task(self._execute_pipeline(job_id, config))
    return job_id
```

### Error Handling Pattern

```python
try:
    result = await self.ai_service.generate_summary(request)
except AIServiceError as e:
    raise HTTPException(status_code=500, detail={
        "error": "AI service error",
        "message": e.message,
        "code": e.error_code
    })
```

## Configuration and Environment

### Required Environment Variables

```bash
# Core Services
ANTHROPIC_API_KEY=sk-ant-...           # Required for AI summarization
YOUTUBE_API_KEY=AIza...                # YouTube Data API v3 key
GOOGLE_API_KEY=AIza...                 # Google/Gemini API key

# Feature Flags
USE_MOCK_SERVICES=false                # Disable mock services
ENABLE_REAL_TRANSCRIPT_EXTRACTION=true # Enable real transcript extraction

# Video Download & Storage Configuration
VIDEO_DOWNLOAD_STORAGE_PATH=./video_storage     # Base storage directory
VIDEO_DOWNLOAD_KEEP_AUDIO_FILES=true           # Save audio for re-transcription
VIDEO_DOWNLOAD_AUDIO_CLEANUP_DAYS=30           # Audio retention period
VIDEO_DOWNLOAD_MAX_STORAGE_GB=10               # Storage limit

# Dual Transcript Configuration
# Whisper AI transcription requires additional dependencies:
# pip install torch whisper pydub yt-dlp pytubefix
# Optional: CUDA for GPU acceleration

# Optional Configuration
DATABASE_URL=sqlite:///./data/app.db   # Database connection
REDIS_URL=redis://localhost:6379/0    # Cache backend (optional)
LOG_LEVEL=INFO                         # Logging level
CORS_ORIGINS=http://localhost:3000     # Frontend origins
```

### Service Configuration

Services are configured through dependency injection with sensible defaults:

```python
# Cost-optimized AI model
ai_service = AnthropicSummarizer(
    api_key=api_key,
    model="claude-3-5-haiku-20241022"  # Cost-effective choice
)

# Cache with TTL
cache_manager = CacheManager(default_ttl=3600)  # 1 hour default

# Pipeline with retry logic
config = PipelineConfig(
    summary_length="standard",
    quality_threshold=0.7,
    max_retries=2,
    enable_notifications=True
)
```

## Testing Strategy

### Unit Tests
- **Location**: `tests/unit/`
- **Coverage**: 17+ tests for pipeline orchestration
- **Mocking**: All external services mocked
- **Patterns**: Async test patterns with proper fixtures

### Integration Tests
- **Location**: `tests/integration/`
- **Coverage**: 20+ API endpoint scenarios
- **Testing**: Full FastAPI integration with TestClient
- **Validation**: Request/response validation and error handling

### Running Tests

```bash
# From backend directory
PYTHONPATH=/path/to/youtube-summarizer python3 -m pytest tests/unit/ -v
PYTHONPATH=/path/to/youtube-summarizer python3 -m pytest tests/integration/ -v

# With coverage
python3 -m pytest tests/ --cov=backend --cov-report=html
```

## Common Development Tasks

### Adding New API Endpoints

1. Create endpoint in appropriate `api/` module
2. Add business logic to `services/` layer
3. Update `main.py` to include router
4. Add unit and integration tests
5. Update API documentation

### Adding New Services

1. Create service class in `services/`
2. Implement proper async patterns
3. Add error handling with custom exceptions
4. Create dependency injection function
5. Add comprehensive unit tests

### Debugging Pipeline Issues

```python
# Enable detailed logging
import logging
logging.getLogger("backend").setLevel(logging.DEBUG)

# Check pipeline status
pipeline = get_summary_pipeline()
result = await pipeline.get_pipeline_result(job_id)
print(f"Status: {result.status}, Error: {result.error}")

# Monitor active jobs
active_jobs = pipeline.get_active_jobs()
print(f"Active jobs: {len(active_jobs)}")
```

## Performance Optimization

### Async Patterns
- All I/O operations use async/await
- Background tasks for long-running operations
- Connection pooling for external services
- Proper exception handling to prevent blocking

### Caching Strategy
- Pipeline results cached for 1 hour
- Transcript and metadata cached separately
- Cache invalidation on video updates
- Redis-ready for distributed caching

### Cost Optimization
- Claude 3.5 Haiku for 80% cost savings vs GPT-4
- Intelligent chunking prevents token waste
- Cost estimation and limits
- Quality scoring to avoid unnecessary retries

## Security Considerations

### API Security
- Environment variable for API keys
- Input validation on all endpoints
- Rate limiting (implement with Redis)
- CORS configuration for frontend origins

### Error Sanitization
```python
# Never expose internal errors to clients
except Exception as e:
    logger.error(f"Internal error: {e}")
    raise HTTPException(status_code=500, detail="Internal server error")
```

### Content Validation
```python
# Validate transcript length
if len(request.transcript.strip()) < 50:
    raise HTTPException(status_code=400, detail="Transcript too short")
```

## Monitoring and Observability

### Health Checks
- `/api/health` - Service health status
- `/api/stats` - Pipeline processing statistics
- WebSocket connection monitoring
- Background job tracking

### Logging
- Structured logging with JSON format
- Error tracking with context
- Performance metrics logging
- Request/response logging (without sensitive data)

### Metrics
```python
# Built-in metrics
stats = {
    "active_jobs": len(pipeline.get_active_jobs()),
    "cache_stats": await cache_manager.get_cache_stats(),
    "notification_stats": notification_service.get_notification_stats(),
    "websocket_connections": websocket_manager.get_stats()
}
```

## Deployment Considerations

### Production Configuration
- Use Redis for caching and session storage
- Configure proper logging (structured JSON)
- Set up health checks and monitoring
- Use environment-specific configuration
- Enable HTTPS and security headers

### Scaling Patterns
- Stateless design enables horizontal scaling
- Background job processing via task queue
- Database connection pooling
- Load balancer health checks

### Database Migrations
```bash
# When adding database models
alembic revision --autogenerate -m "Add pipeline models"
alembic upgrade head
```

## Troubleshooting

### Common Issues

**"Anthropic API key not configured"**
- Solution: Set `ANTHROPIC_API_KEY` environment variable

**"Mock data returned instead of real transcripts"**
- Check: `USE_MOCK_SERVICES=false` in .env
- Solution: Set `ENABLE_REAL_TRANSCRIPT_EXTRACTION=true`

**"404 Not Found for /api/transcripts/extract"**
- Check: Import statements in main.py
- Solution: Use `from backend.api.transcripts import router` (not transcripts_stub)

**"Radio button selection not working"**
- Issue: Circular state updates in React
- Solution: Use ref tracking in useTranscriptSelector hook

**Pipeline jobs stuck in "processing" state**
- Check: `pipeline.get_active_jobs()` for zombie jobs
- Solution: Restart service or call cleanup endpoint

**WebSocket connections not receiving updates**
- Check: WebSocket connection in browser dev tools
- Solution: Verify WebSocket manager singleton initialization

**High AI costs**
- Check: Summary length configuration and transcript sizes
- Solution: Implement cost limits and brief summary defaults

**Transcript extraction failures**
- Check: IntelligentVideoDownloader fallback chain logs
- Solution: Review which tier failed and check API keys/dependencies

### Debug Commands

```python
# Pipeline debugging
from backend.services.summary_pipeline import SummaryPipeline
pipeline = SummaryPipeline(...)
result = await pipeline.get_pipeline_result("job_id")

# Cache debugging
from backend.services.cache_manager import CacheManager
cache = CacheManager()
stats = await cache.get_cache_stats()

# WebSocket debugging
from backend.core.websocket_manager import websocket_manager
connections = websocket_manager.get_stats()
```

This backend is designed for production use with comprehensive error handling, monitoring, and scalability patterns. All services follow async patterns and clean architecture principles.