# CLAUDE.md - YouTube Summarizer This file provides guidance to Claude Code (claude.ai/code) when working with the YouTube Summarizer project. ## Project Overview An AI-powered web application that automatically extracts, transcribes, and summarizes YouTube videos. The application supports multiple AI models (OpenAI, Anthropic, DeepSeek), provides various export formats, and includes intelligent caching for efficiency. **Status**: Development Phase - 12 tasks (0% complete) managed via Task Master ## Quick Start Commands ```bash # Development cd apps/youtube-summarizer source venv/bin/activate # Activate virtual environment python src/main.py # Run the application (port 8082) # Task Management task-master list # View all tasks task-master next # Get next task to work on task-master show # View task details task-master set-status --id= --status=done # Mark task complete # Testing pytest tests/ -v # Run tests pytest tests/ --cov=src # Run with coverage # Git Operations git add . git commit -m "feat: implement task X.Y" git push origin main ``` ## Architecture ``` YouTube Summarizer ├── API Layer (FastAPI) │ ├── /api/summarize - Submit URL for summarization │ ├── /api/summary/{id} - Retrieve summary │ └── /api/export/{id} - Export in various formats ├── Service Layer │ ├── YouTube Service - Transcript extraction │ ├── AI Service - Summary generation │ └── Cache Service - Performance optimization └── Data Layer ├── SQLite/PostgreSQL - Summary storage └── Redis (optional) - Caching layer ``` ## Development Workflow ### 1. Check Current Task ```bash task-master next task-master show ``` ### 2. Implement Feature Follow the task details and implement in appropriate modules: - API endpoints → `src/api/` - Business logic → `src/services/` - Utilities → `src/utils/` ### 3. Test Implementation ```bash # Unit tests pytest tests/unit/test_.py -v # Integration tests pytest tests/integration/ -v # Manual testing python src/main.py # Visit http://localhost:8082/docs for API testing ``` ### 4. Update Task Status ```bash # Log progress task-master update-subtask --id= --prompt="Implemented X, tested Y" # Mark complete task-master set-status --id= --status=done ``` ## Key Implementation Areas ### YouTube Integration (`src/services/youtube.py`) ```python # Primary: youtube-transcript-api from youtube_transcript_api import YouTubeTranscriptApi # Fallback: yt-dlp for metadata import yt_dlp # Extract video ID from various URL formats # Handle multiple subtitle languages # Implement retry logic for failures ``` ### AI Summarization (`src/services/summarizer.py`) ```python # Multi-model support class SummarizerService: def __init__(self): self.models = { 'openai': OpenAISummarizer(), 'anthropic': AnthropicSummarizer(), 'deepseek': DeepSeekSummarizer() } async def summarize(self, transcript, model='auto'): # Implement model selection logic # Handle token limits # Generate structured summaries ``` ### Caching Strategy (`src/services/cache.py`) ```python # Cache at multiple levels: # 1. Transcript cache (by video_id) # 2. Summary cache (by video_id + model + params) # 3. Export cache (by summary_id + format) # Use hash for cache keys import hashlib def get_cache_key(video_id: str, model: str, params: dict) -> str: key_data = f"{video_id}:{model}:{json.dumps(params, sort_keys=True)}" return hashlib.sha256(key_data.encode()).hexdigest() ``` ## API Endpoint Patterns ### FastAPI Best Practices ```python from fastapi import APIRouter, HTTPException, BackgroundTasks from pydantic import BaseModel, HttpUrl router = APIRouter(prefix="/api", tags=["summarization"]) class SummarizeRequest(BaseModel): url: HttpUrl model: str = "auto" options: dict = {} @router.post("/summarize") async def summarize_video( request: SummarizeRequest, background_tasks: BackgroundTasks ): # Validate URL # Extract video ID # Check cache # Queue for processing if needed # Return job ID for status checking ``` ## Database Schema ```sql -- Main summaries table CREATE TABLE summaries ( id UUID PRIMARY KEY, video_id VARCHAR(20) NOT NULL, video_title TEXT, video_url TEXT NOT NULL, transcript TEXT, summary TEXT, key_points JSONB, chapters JSONB, model_used VARCHAR(50), processing_time FLOAT, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- Cache for performance CREATE INDEX idx_video_id ON summaries(video_id); CREATE INDEX idx_created_at ON summaries(created_at); ``` ## Error Handling ```python class YouTubeError(Exception): """Base exception for YouTube-related errors""" pass class TranscriptNotAvailable(YouTubeError): """Raised when transcript cannot be extracted""" pass class AIServiceError(Exception): """Base exception for AI service errors""" pass class TokenLimitExceeded(AIServiceError): """Raised when content exceeds model token limit""" pass # Global error handler @app.exception_handler(YouTubeError) async def youtube_error_handler(request, exc): return JSONResponse( status_code=400, content={"error": str(exc), "type": "youtube_error"} ) ``` ## Environment Variables ```bash # Required OPENAI_API_KEY=sk-... # At least one AI key required ANTHROPIC_API_KEY=sk-ant-... DEEPSEEK_API_KEY=sk-... DATABASE_URL=sqlite:///./data/youtube_summarizer.db SECRET_KEY=your-secret-key # Optional but recommended YOUTUBE_API_KEY=AIza... # For metadata and quota REDIS_URL=redis://localhost:6379/0 RATE_LIMIT_PER_MINUTE=30 MAX_VIDEO_LENGTH_MINUTES=180 ``` ## Testing Guidelines ### Unit Test Structure ```python # tests/unit/test_youtube_service.py import pytest from unittest.mock import Mock, patch from src.services.youtube import YouTubeService @pytest.fixture def youtube_service(): return YouTubeService() def test_extract_video_id(youtube_service): urls = [ ("https://youtube.com/watch?v=abc123", "abc123"), ("https://youtu.be/xyz789", "xyz789"), ("https://www.youtube.com/embed/qwe456", "qwe456") ] for url, expected_id in urls: assert youtube_service.extract_video_id(url) == expected_id ``` ### Integration Test Pattern ```python # tests/integration/test_api.py from fastapi.testclient import TestClient from src.main import app client = TestClient(app) def test_summarize_endpoint(): response = client.post("/api/summarize", json={ "url": "https://youtube.com/watch?v=test123", "model": "openai" }) assert response.status_code == 200 assert "job_id" in response.json() ``` ## Performance Optimization 1. **Async Everything**: Use async/await for all I/O operations 2. **Background Tasks**: Process summaries in background 3. **Caching Layers**: - Memory cache for hot data - Database cache for persistence - CDN for static exports 4. **Rate Limiting**: Implement per-IP and per-user limits 5. **Token Optimization**: - Chunk long transcripts - Use map-reduce for summaries - Implement progressive summarization ## Security Considerations 1. **Input Validation**: Validate all YouTube URLs 2. **API Key Management**: Use environment variables, never commit keys 3. **Rate Limiting**: Prevent abuse and API exhaustion 4. **CORS Configuration**: Restrict to known domains in production 5. **SQL Injection Prevention**: Use parameterized queries 6. **XSS Protection**: Sanitize all user inputs 7. **Authentication**: Implement JWT for user sessions (Phase 3) ## Common Issues and Solutions ### Issue: Transcript Not Available ```python # Solution: Implement fallback chain try: transcript = await get_youtube_transcript(video_id) except TranscriptNotAvailable: # Try auto-generated captions transcript = await get_auto_captions(video_id) if not transcript: # Use audio transcription as last resort transcript = await transcribe_audio(video_id) ``` ### Issue: Token Limit Exceeded ```python # Solution: Implement chunking def chunk_transcript(transcript, max_tokens=3000): chunks = [] current_chunk = [] current_tokens = 0 for segment in transcript: segment_tokens = count_tokens(segment) if current_tokens + segment_tokens > max_tokens: chunks.append(current_chunk) current_chunk = [segment] current_tokens = segment_tokens else: current_chunk.append(segment) current_tokens += segment_tokens if current_chunk: chunks.append(current_chunk) return chunks ``` ### Issue: Rate Limiting ```python # Solution: Implement exponential backoff import asyncio from typing import Optional async def retry_with_backoff( func, max_retries: int = 3, initial_delay: float = 1.0 ) -> Optional[Any]: delay = initial_delay for attempt in range(max_retries): try: return await func() except RateLimitError: if attempt == max_retries - 1: raise await asyncio.sleep(delay) delay *= 2 # Exponential backoff ``` ## Development Tips 1. **Start with Task 1**: Setup and environment configuration 2. **Test Early**: Write tests as you implement features 3. **Use Type Hints**: Improve code quality and IDE support 4. **Document APIs**: Use FastAPI's automatic documentation 5. **Log Everything**: Implement comprehensive logging for debugging 6. **Cache Aggressively**: Reduce API calls and improve response times 7. **Handle Errors Gracefully**: Provide helpful error messages to users ## Task Master Integration This project uses Task Master for task management. Key commands: ```bash # View current progress task-master list # Get detailed task info task-master show 1 # Expand task into subtasks task-master expand --id=1 --research # Update task with progress task-master update-task --id=1 --prompt="Completed API structure" # Complete task task-master set-status --id=1 --status=done ``` ## Related Documentation - [Project README](README.md) - General project information - [AGENTS.md](AGENTS.md) - Development workflow and standards - [Task Master Guide](.taskmaster/CLAUDE.md) - Task management details - [API Documentation](http://localhost:8082/docs) - Interactive API docs (when running) ## Current Focus Areas (Based on Task Master) 1. **Task 1**: Setup Project Structure and Environment ⬅️ Start here 2. **Task 2**: Implement YouTube Transcript Extraction 3. **Task 3**: Develop AI Summary Generation Service 4. **Task 4**: Create Basic Frontend Interface 5. **Task 5**: Implement FastAPI Backend Endpoints Remember to check task dependencies and complete prerequisites before moving to dependent tasks. --- *This guide is specifically tailored for Claude Code development on the YouTube Summarizer project.*