youtube-summarizer/CLAUDE.md

# CLAUDE.md - YouTube Summarizer

This file provides guidance to Claude Code (claude.ai/code) when working with the YouTube Summarizer project.

## Project Overview

An AI-powered web application that automatically extracts, transcribes, and summarizes YouTube videos. The application supports multiple AI models (OpenAI, Anthropic, DeepSeek), provides various export formats, and includes intelligent caching for efficiency.

**Status**: Development Phase - 12 tasks (0% complete) managed via Task Master

## Quick Start Commands

```bash
# Development
cd apps/youtube-summarizer
source venv/bin/activate      # Activate virtual environment
python src/main.py            # Run the application (port 8082)

# Task Management
task-master list              # View all tasks
task-master next              # Get next task to work on
task-master show <id>         # View task details
task-master set-status --id=<id> --status=done  # Mark task complete

# Testing
pytest tests/ -v              # Run tests
pytest tests/ --cov=src       # Run with coverage

# Git Operations
git add .
git commit -m "feat: implement task X.Y"
git push origin main
```

## Architecture

```
YouTube Summarizer
├── API Layer (FastAPI)
│   ├── /api/summarize - Submit URL for summarization
│   ├── /api/summary/{id} - Retrieve summary
│   └── /api/export/{id} - Export in various formats
├── Service Layer
│   ├── YouTube Service - Transcript extraction
│   ├── AI Service - Summary generation
│   └── Cache Service - Performance optimization
└── Data Layer
    ├── SQLite/PostgreSQL - Summary storage
    └── Redis (optional) - Caching layer
```

## Development Workflow

### 1. Check Current Task
```bash
task-master next
task-master show <id>
```

### 2. Implement Feature
Follow the task details and implement in appropriate modules:
- API endpoints → `src/api/`
- Business logic → `src/services/`
- Utilities → `src/utils/`

### 3. Test Implementation
```bash
# Unit tests
pytest tests/unit/test_<module>.py -v

# Integration tests
pytest tests/integration/ -v

# Manual testing
python src/main.py
# Visit http://localhost:8082/docs for API testing
```

### 4. Update Task Status
```bash
# Log progress
task-master update-subtask --id=<id> --prompt="Implemented X, tested Y"

# Mark complete
task-master set-status --id=<id> --status=done
```

## Key Implementation Areas

### YouTube Integration (`src/services/youtube.py`)
```python
# Primary: youtube-transcript-api
from youtube_transcript_api import YouTubeTranscriptApi

# Fallback: yt-dlp for metadata
import yt_dlp

# Extract video ID from various URL formats
# Handle multiple subtitle languages
# Implement retry logic for failures
```

### AI Summarization (`src/services/summarizer.py`)
```python
# Multi-model support
class SummarizerService:
    def __init__(self):
        self.models = {
            'openai': OpenAISummarizer(),
            'anthropic': AnthropicSummarizer(),
            'deepseek': DeepSeekSummarizer()
        }

    async def summarize(self, transcript, model='auto'):
        # Implement model selection logic
        # Handle token limits
        # Generate structured summaries
```

### Caching Strategy (`src/services/cache.py`)
```python
# Cache at multiple levels:
# 1. Transcript cache (by video_id)
# 2. Summary cache (by video_id + model + params)
# 3. Export cache (by summary_id + format)

# Use hash for cache keys
import hashlib

def get_cache_key(video_id: str, model: str, params: dict) -> str:
    key_data = f"{video_id}:{model}:{json.dumps(params, sort_keys=True)}"
    return hashlib.sha256(key_data.encode()).hexdigest()
```

## API Endpoint Patterns

### FastAPI Best Practices
```python
from fastapi import APIRouter, HTTPException, BackgroundTasks
from pydantic import BaseModel, HttpUrl

router = APIRouter(prefix="/api", tags=["summarization"])

class SummarizeRequest(BaseModel):
    url: HttpUrl
    model: str = "auto"
    options: dict = {}

@router.post("/summarize")
async def summarize_video(
    request: SummarizeRequest,
    background_tasks: BackgroundTasks
):
    # Validate URL
    # Extract video ID
    # Check cache
    # Queue for processing if needed
    # Return job ID for status checking
```

## Database Schema

```sql
-- Main summaries table
CREATE TABLE summaries (
    id UUID PRIMARY KEY,
    video_id VARCHAR(20) NOT NULL,
    video_title TEXT,
    video_url TEXT NOT NULL,
    transcript TEXT,
    summary TEXT,
    key_points JSONB,
    chapters JSONB,
    model_used VARCHAR(50),
    processing_time FLOAT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Cache for performance
CREATE INDEX idx_video_id ON summaries(video_id);
CREATE INDEX idx_created_at ON summaries(created_at);
```

## Error Handling

```python
class YouTubeError(Exception):
    """Base exception for YouTube-related errors"""
    pass

class TranscriptNotAvailable(YouTubeError):
    """Raised when transcript cannot be extracted"""
    pass

class AIServiceError(Exception):
    """Base exception for AI service errors"""
    pass

class TokenLimitExceeded(AIServiceError):
    """Raised when content exceeds model token limit"""
    pass

# Global error handler
@app.exception_handler(YouTubeError)
async def youtube_error_handler(request, exc):
    return JSONResponse(
        status_code=400,
        content={"error": str(exc), "type": "youtube_error"}
    )
```

## Environment Variables

```bash
# Required
OPENAI_API_KEY=sk-...        # At least one AI key required
ANTHROPIC_API_KEY=sk-ant-...
DEEPSEEK_API_KEY=sk-...
DATABASE_URL=sqlite:///./data/youtube_summarizer.db
SECRET_KEY=your-secret-key

# Optional but recommended
YOUTUBE_API_KEY=AIza...       # For metadata and quota
REDIS_URL=redis://localhost:6379/0
RATE_LIMIT_PER_MINUTE=30
MAX_VIDEO_LENGTH_MINUTES=180
```

## Testing Guidelines

### Unit Test Structure
```python
# tests/unit/test_youtube_service.py
import pytest
from unittest.mock import Mock, patch
from src.services.youtube import YouTubeService

@pytest.fixture
def youtube_service():
    return YouTubeService()

def test_extract_video_id(youtube_service):
    urls = [
        ("https://youtube.com/watch?v=abc123", "abc123"),
        ("https://youtu.be/xyz789", "xyz789"),
        ("https://www.youtube.com/embed/qwe456", "qwe456")
    ]
    for url, expected_id in urls:
        assert youtube_service.extract_video_id(url) == expected_id
```

### Integration Test Pattern
```python
# tests/integration/test_api.py
from fastapi.testclient import TestClient
from src.main import app

client = TestClient(app)

def test_summarize_endpoint():
    response = client.post("/api/summarize", json={
        "url": "https://youtube.com/watch?v=test123",
        "model": "openai"
    })
    assert response.status_code == 200
    assert "job_id" in response.json()
```

## Performance Optimization

1. **Async Everything**: Use async/await for all I/O operations
2. **Background Tasks**: Process summaries in background
3. **Caching Layers**:
   - Memory cache for hot data
   - Database cache for persistence
   - CDN for static exports
4. **Rate Limiting**: Implement per-IP and per-user limits
5. **Token Optimization**:
   - Chunk long transcripts
   - Use map-reduce for summaries
   - Implement progressive summarization

## Security Considerations

1. **Input Validation**: Validate all YouTube URLs
2. **API Key Management**: Use environment variables, never commit keys
3. **Rate Limiting**: Prevent abuse and API exhaustion
4. **CORS Configuration**: Restrict to known domains in production
5. **SQL Injection Prevention**: Use parameterized queries
6. **XSS Protection**: Sanitize all user inputs
7. **Authentication**: Implement JWT for user sessions (Phase 3)

## Common Issues and Solutions

### Issue: Transcript Not Available
```python
# Solution: Implement fallback chain
try:
    transcript = await get_youtube_transcript(video_id)
except TranscriptNotAvailable:
    # Try auto-generated captions
    transcript = await get_auto_captions(video_id)
    if not transcript:
        # Use audio transcription as last resort
        transcript = await transcribe_audio(video_id)
```

### Issue: Token Limit Exceeded
```python
# Solution: Implement chunking
def chunk_transcript(transcript, max_tokens=3000):
    chunks = []
    current_chunk = []
    current_tokens = 0

    for segment in transcript:
        segment_tokens = count_tokens(segment)
        if current_tokens + segment_tokens > max_tokens:
            chunks.append(current_chunk)
            current_chunk = [segment]
            current_tokens = segment_tokens
        else:
            current_chunk.append(segment)
            current_tokens += segment_tokens

    if current_chunk:
        chunks.append(current_chunk)

    return chunks
```

### Issue: Rate Limiting
```python
# Solution: Implement exponential backoff
import asyncio
from typing import Optional

async def retry_with_backoff(
    func,
    max_retries: int = 3,
    initial_delay: float = 1.0
) -> Optional[Any]:
    delay = initial_delay
    for attempt in range(max_retries):
        try:
            return await func()
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            await asyncio.sleep(delay)
            delay *= 2  # Exponential backoff
```

## Development Tips

1. **Start with Task 1**: Setup and environment configuration
2. **Test Early**: Write tests as you implement features
3. **Use Type Hints**: Improve code quality and IDE support
4. **Document APIs**: Use FastAPI's automatic documentation
5. **Log Everything**: Implement comprehensive logging for debugging
6. **Cache Aggressively**: Reduce API calls and improve response times
7. **Handle Errors Gracefully**: Provide helpful error messages to users

## Task Master Integration

This project uses Task Master for task management. Key commands:

```bash
# View current progress
task-master list

# Get detailed task info
task-master show 1

# Expand task into subtasks
task-master expand --id=1 --research

# Update task with progress
task-master update-task --id=1 --prompt="Completed API structure"

# Complete task
task-master set-status --id=1 --status=done
```

## Related Documentation

- [Project README](README.md) - General project information
- [AGENTS.md](AGENTS.md) - Development workflow and standards
- [Task Master Guide](.taskmaster/CLAUDE.md) - Task management details
- [API Documentation](http://localhost:8082/docs) - Interactive API docs (when running)

## Current Focus Areas (Based on Task Master)

1. **Task 1**: Setup Project Structure and Environment ⬅️ Start here
2. **Task 2**: Implement YouTube Transcript Extraction
3. **Task 3**: Develop AI Summary Generation Service
4. **Task 4**: Create Basic Frontend Interface
5. **Task 5**: Implement FastAPI Backend Endpoints

Remember to check task dependencies and complete prerequisites before moving to dependent tasks.

---

*This guide is specifically tailored for Claude Code development on the YouTube Summarizer project.*