youtube-summarizer/CLAUDE.md

11 KiB

CLAUDE.md - YouTube Summarizer

This file provides guidance to Claude Code (claude.ai/code) when working with the YouTube Summarizer project.

Project Overview

An AI-powered web application that automatically extracts, transcribes, and summarizes YouTube videos. The application supports multiple AI models (OpenAI, Anthropic, DeepSeek), provides various export formats, and includes intelligent caching for efficiency.

Status: Development Phase - 12 tasks (0% complete) managed via Task Master

Quick Start Commands

# Development
cd apps/youtube-summarizer
source venv/bin/activate      # Activate virtual environment
python src/main.py            # Run the application (port 8082)

# Task Management
task-master list              # View all tasks
task-master next              # Get next task to work on
task-master show <id>         # View task details
task-master set-status --id=<id> --status=done  # Mark task complete

# Testing
pytest tests/ -v              # Run tests
pytest tests/ --cov=src       # Run with coverage

# Git Operations
git add .
git commit -m "feat: implement task X.Y"
git push origin main

Architecture

YouTube Summarizer
├── API Layer (FastAPI)
│   ├── /api/summarize - Submit URL for summarization
│   ├── /api/summary/{id} - Retrieve summary
│   └── /api/export/{id} - Export in various formats
├── Service Layer
│   ├── YouTube Service - Transcript extraction
│   ├── AI Service - Summary generation
│   └── Cache Service - Performance optimization
└── Data Layer
    ├── SQLite/PostgreSQL - Summary storage
    └── Redis (optional) - Caching layer

Development Workflow

1. Check Current Task

task-master next
task-master show <id>

2. Implement Feature

Follow the task details and implement in appropriate modules:

  • API endpoints → src/api/
  • Business logic → src/services/
  • Utilities → src/utils/

3. Test Implementation

# Unit tests
pytest tests/unit/test_<module>.py -v

# Integration tests
pytest tests/integration/ -v

# Manual testing
python src/main.py
# Visit http://localhost:8082/docs for API testing

4. Update Task Status

# Log progress
task-master update-subtask --id=<id> --prompt="Implemented X, tested Y"

# Mark complete
task-master set-status --id=<id> --status=done

Key Implementation Areas

YouTube Integration (src/services/youtube.py)

# Primary: youtube-transcript-api
from youtube_transcript_api import YouTubeTranscriptApi

# Fallback: yt-dlp for metadata
import yt_dlp

# Extract video ID from various URL formats
# Handle multiple subtitle languages
# Implement retry logic for failures

AI Summarization (src/services/summarizer.py)

# Multi-model support
class SummarizerService:
    def __init__(self):
        self.models = {
            'openai': OpenAISummarizer(),
            'anthropic': AnthropicSummarizer(),
            'deepseek': DeepSeekSummarizer()
        }
    
    async def summarize(self, transcript, model='auto'):
        # Implement model selection logic
        # Handle token limits
        # Generate structured summaries

Caching Strategy (src/services/cache.py)

# Cache at multiple levels:
# 1. Transcript cache (by video_id)
# 2. Summary cache (by video_id + model + params)
# 3. Export cache (by summary_id + format)

# Use hash for cache keys
import hashlib

def get_cache_key(video_id: str, model: str, params: dict) -> str:
    key_data = f"{video_id}:{model}:{json.dumps(params, sort_keys=True)}"
    return hashlib.sha256(key_data.encode()).hexdigest()

API Endpoint Patterns

FastAPI Best Practices

from fastapi import APIRouter, HTTPException, BackgroundTasks
from pydantic import BaseModel, HttpUrl

router = APIRouter(prefix="/api", tags=["summarization"])

class SummarizeRequest(BaseModel):
    url: HttpUrl
    model: str = "auto"
    options: dict = {}

@router.post("/summarize")
async def summarize_video(
    request: SummarizeRequest,
    background_tasks: BackgroundTasks
):
    # Validate URL
    # Extract video ID
    # Check cache
    # Queue for processing if needed
    # Return job ID for status checking

Database Schema

-- Main summaries table
CREATE TABLE summaries (
    id UUID PRIMARY KEY,
    video_id VARCHAR(20) NOT NULL,
    video_title TEXT,
    video_url TEXT NOT NULL,
    transcript TEXT,
    summary TEXT,
    key_points JSONB,
    chapters JSONB,
    model_used VARCHAR(50),
    processing_time FLOAT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Cache for performance
CREATE INDEX idx_video_id ON summaries(video_id);
CREATE INDEX idx_created_at ON summaries(created_at);

Error Handling

class YouTubeError(Exception):
    """Base exception for YouTube-related errors"""
    pass

class TranscriptNotAvailable(YouTubeError):
    """Raised when transcript cannot be extracted"""
    pass

class AIServiceError(Exception):
    """Base exception for AI service errors"""
    pass

class TokenLimitExceeded(AIServiceError):
    """Raised when content exceeds model token limit"""
    pass

# Global error handler
@app.exception_handler(YouTubeError)
async def youtube_error_handler(request, exc):
    return JSONResponse(
        status_code=400,
        content={"error": str(exc), "type": "youtube_error"}
    )

Environment Variables

# Required
OPENAI_API_KEY=sk-...        # At least one AI key required
ANTHROPIC_API_KEY=sk-ant-...
DEEPSEEK_API_KEY=sk-...
DATABASE_URL=sqlite:///./data/youtube_summarizer.db
SECRET_KEY=your-secret-key

# Optional but recommended
YOUTUBE_API_KEY=AIza...       # For metadata and quota
REDIS_URL=redis://localhost:6379/0
RATE_LIMIT_PER_MINUTE=30
MAX_VIDEO_LENGTH_MINUTES=180

Testing Guidelines

Unit Test Structure

# tests/unit/test_youtube_service.py
import pytest
from unittest.mock import Mock, patch
from src.services.youtube import YouTubeService

@pytest.fixture
def youtube_service():
    return YouTubeService()

def test_extract_video_id(youtube_service):
    urls = [
        ("https://youtube.com/watch?v=abc123", "abc123"),
        ("https://youtu.be/xyz789", "xyz789"),
        ("https://www.youtube.com/embed/qwe456", "qwe456")
    ]
    for url, expected_id in urls:
        assert youtube_service.extract_video_id(url) == expected_id

Integration Test Pattern

# tests/integration/test_api.py
from fastapi.testclient import TestClient
from src.main import app

client = TestClient(app)

def test_summarize_endpoint():
    response = client.post("/api/summarize", json={
        "url": "https://youtube.com/watch?v=test123",
        "model": "openai"
    })
    assert response.status_code == 200
    assert "job_id" in response.json()

Performance Optimization

  1. Async Everything: Use async/await for all I/O operations
  2. Background Tasks: Process summaries in background
  3. Caching Layers:
    • Memory cache for hot data
    • Database cache for persistence
    • CDN for static exports
  4. Rate Limiting: Implement per-IP and per-user limits
  5. Token Optimization:
    • Chunk long transcripts
    • Use map-reduce for summaries
    • Implement progressive summarization

Security Considerations

  1. Input Validation: Validate all YouTube URLs
  2. API Key Management: Use environment variables, never commit keys
  3. Rate Limiting: Prevent abuse and API exhaustion
  4. CORS Configuration: Restrict to known domains in production
  5. SQL Injection Prevention: Use parameterized queries
  6. XSS Protection: Sanitize all user inputs
  7. Authentication: Implement JWT for user sessions (Phase 3)

Common Issues and Solutions

Issue: Transcript Not Available

# Solution: Implement fallback chain
try:
    transcript = await get_youtube_transcript(video_id)
except TranscriptNotAvailable:
    # Try auto-generated captions
    transcript = await get_auto_captions(video_id)
    if not transcript:
        # Use audio transcription as last resort
        transcript = await transcribe_audio(video_id)

Issue: Token Limit Exceeded

# Solution: Implement chunking
def chunk_transcript(transcript, max_tokens=3000):
    chunks = []
    current_chunk = []
    current_tokens = 0
    
    for segment in transcript:
        segment_tokens = count_tokens(segment)
        if current_tokens + segment_tokens > max_tokens:
            chunks.append(current_chunk)
            current_chunk = [segment]
            current_tokens = segment_tokens
        else:
            current_chunk.append(segment)
            current_tokens += segment_tokens
    
    if current_chunk:
        chunks.append(current_chunk)
    
    return chunks

Issue: Rate Limiting

# Solution: Implement exponential backoff
import asyncio
from typing import Optional

async def retry_with_backoff(
    func, 
    max_retries: int = 3,
    initial_delay: float = 1.0
) -> Optional[Any]:
    delay = initial_delay
    for attempt in range(max_retries):
        try:
            return await func()
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            await asyncio.sleep(delay)
            delay *= 2  # Exponential backoff

Development Tips

  1. Start with Task 1: Setup and environment configuration
  2. Test Early: Write tests as you implement features
  3. Use Type Hints: Improve code quality and IDE support
  4. Document APIs: Use FastAPI's automatic documentation
  5. Log Everything: Implement comprehensive logging for debugging
  6. Cache Aggressively: Reduce API calls and improve response times
  7. Handle Errors Gracefully: Provide helpful error messages to users

Task Master Integration

This project uses Task Master for task management. Key commands:

# View current progress
task-master list

# Get detailed task info
task-master show 1

# Expand task into subtasks
task-master expand --id=1 --research

# Update task with progress
task-master update-task --id=1 --prompt="Completed API structure"

# Complete task
task-master set-status --id=1 --status=done

Current Focus Areas (Based on Task Master)

  1. Task 1: Setup Project Structure and Environment ⬅️ Start here
  2. Task 2: Implement YouTube Transcript Extraction
  3. Task 3: Develop AI Summary Generation Service
  4. Task 4: Create Basic Frontend Interface
  5. Task 5: Implement FastAPI Backend Endpoints

Remember to check task dependencies and complete prerequisites before moving to dependent tasks.


This guide is specifically tailored for Claude Code development on the YouTube Summarizer project.