# CLAUDE.md - YouTube Summarizer This file provides guidance to Claude Code (claude.ai/code) when working with the YouTube Summarizer project. ## Project Overview An AI-powered web application that automatically extracts, transcribes, and summarizes YouTube videos. The application supports multiple AI models (OpenAI, Anthropic, DeepSeek), provides various export formats, and includes intelligent caching for efficiency. **Status**: Development Ready - All Epic 1 & 2 stories created and ready for implementation - **Epic 1**: Foundation & Core YouTube Integration (Story 1.1 ✅ Complete, Stories 1.2-1.4 📋 Ready) - **Epic 2**: AI Summarization Engine (Stories 2.1-2.5 📋 All Created and Ready) - **Epic 3**: Enhanced User Experience (Future - Ready for story creation) ## Quick Start Commands ```bash # Development Setup cd apps/youtube-summarizer docker-compose up # Start full development environment # BMad Method Story Management /BMad:agents:sm # Activate Scrum Master agent *draft # Create next story *story-checklist # Validate story quality # Development Agent Implementation /BMad:agents:dev # Activate Development agent # Follow story specifications in docs/stories/ # Direct Development (without BMad agents) source venv/bin/activate # Activate virtual environment python backend/main.py # Run backend (port 8000) cd frontend && npm run dev # Run frontend (port 3000) # Testing pytest backend/tests/ -v # Backend tests cd frontend && npm test # Frontend tests # Git Operations git add . git commit -m "feat: implement story 1.2 - URL validation" git push origin main ``` ## Architecture ``` YouTube Summarizer ├── API Layer (FastAPI) │ ├── /api/summarize - Submit URL for summarization │ ├── /api/summary/{id} - Retrieve summary │ └── /api/export/{id} - Export in various formats ├── Service Layer │ ├── YouTube Service - Transcript extraction │ ├── AI Service - Summary generation │ └── Cache Service - Performance optimization └── Data Layer ├── SQLite/PostgreSQL - Summary storage └── Redis (optional) - Caching layer ``` ## Development Workflow - BMad Method ### Story-Driven Development Process **Current Epic**: Epic 1 - Foundation & Core YouTube Integration **Current Stories**: - ✅ Story 1.1: Project Setup and Infrastructure (Completed) - 📝 Story 1.2: YouTube URL Validation and Parsing (Ready for implementation) - ⏳ Story 1.3: Transcript Extraction Service (Pending) - ⏳ Story 1.4: Basic Web Interface (Pending) ### 1. Story Planning (Scrum Master) ```bash # Activate Scrum Master agent /BMad:agents:sm *draft # Create next story in sequence *story-checklist # Validate story completeness ``` ### 2. Story Implementation (Development Agent) ```bash # Activate Development agent /BMad:agents:dev # Review story file: docs/stories/{epic}.{story}.{name}.md # Follow detailed Dev Notes and architecture references # Implement all tasks and subtasks as specified ``` ### 3. Implementation Locations Based on architecture and story specifications: - **Backend API** → `backend/api/` - **Backend Services** → `backend/services/` - **Backend Models** → `backend/models/` - **Frontend Components** → `frontend/src/components/` - **Frontend Hooks** → `frontend/src/hooks/` - **Frontend API Client** → `frontend/src/api/` ### 4. Testing Implementation ```bash # Backend testing (pytest) pytest backend/tests/unit/test_.py -v pytest backend/tests/integration/ -v # Frontend testing (Vitest + RTL) cd frontend && npm test cd frontend && npm run test:coverage # Manual testing docker-compose up # Full stack # Visit http://localhost:3000 (frontend) # Visit http://localhost:8000/docs (API docs) ``` ### 5. Story Completion - Mark all tasks/subtasks complete in story file - Update story status from "Draft" to "Done" - Run story validation checklist - Update epic progress tracking ## Key Implementation Areas ### YouTube Integration (`src/services/youtube.py`) ```python # Primary: youtube-transcript-api from youtube_transcript_api import YouTubeTranscriptApi # Fallback: yt-dlp for metadata import yt_dlp # Extract video ID from various URL formats # Handle multiple subtitle languages # Implement retry logic for failures ``` ### AI Summarization (`src/services/summarizer.py`) ```python # Multi-model support class SummarizerService: def __init__(self): self.models = { 'openai': OpenAISummarizer(), 'anthropic': AnthropicSummarizer(), 'deepseek': DeepSeekSummarizer() } async def summarize(self, transcript, model='auto'): # Implement model selection logic # Handle token limits # Generate structured summaries ``` ### Caching Strategy (`src/services/cache.py`) ```python # Cache at multiple levels: # 1. Transcript cache (by video_id) # 2. Summary cache (by video_id + model + params) # 3. Export cache (by summary_id + format) # Use hash for cache keys import hashlib def get_cache_key(video_id: str, model: str, params: dict) -> str: key_data = f"{video_id}:{model}:{json.dumps(params, sort_keys=True)}" return hashlib.sha256(key_data.encode()).hexdigest() ``` ## API Endpoint Patterns ### FastAPI Best Practices ```python from fastapi import APIRouter, HTTPException, BackgroundTasks from pydantic import BaseModel, HttpUrl router = APIRouter(prefix="/api", tags=["summarization"]) class SummarizeRequest(BaseModel): url: HttpUrl model: str = "auto" options: dict = {} @router.post("/summarize") async def summarize_video( request: SummarizeRequest, background_tasks: BackgroundTasks ): # Validate URL # Extract video ID # Check cache # Queue for processing if needed # Return job ID for status checking ``` ## Database Schema ```sql -- Main summaries table CREATE TABLE summaries ( id UUID PRIMARY KEY, video_id VARCHAR(20) NOT NULL, video_title TEXT, video_url TEXT NOT NULL, transcript TEXT, summary TEXT, key_points JSONB, chapters JSONB, model_used VARCHAR(50), processing_time FLOAT, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- Cache for performance CREATE INDEX idx_video_id ON summaries(video_id); CREATE INDEX idx_created_at ON summaries(created_at); ``` ## Error Handling ```python class YouTubeError(Exception): """Base exception for YouTube-related errors""" pass class TranscriptNotAvailable(YouTubeError): """Raised when transcript cannot be extracted""" pass class AIServiceError(Exception): """Base exception for AI service errors""" pass class TokenLimitExceeded(AIServiceError): """Raised when content exceeds model token limit""" pass # Global error handler @app.exception_handler(YouTubeError) async def youtube_error_handler(request, exc): return JSONResponse( status_code=400, content={"error": str(exc), "type": "youtube_error"} ) ``` ## Environment Variables ```bash # Required OPENAI_API_KEY=sk-... # At least one AI key required ANTHROPIC_API_KEY=sk-ant-... DEEPSEEK_API_KEY=sk-... DATABASE_URL=sqlite:///./data/youtube_summarizer.db SECRET_KEY=your-secret-key # Optional but recommended YOUTUBE_API_KEY=AIza... # For metadata and quota REDIS_URL=redis://localhost:6379/0 RATE_LIMIT_PER_MINUTE=30 MAX_VIDEO_LENGTH_MINUTES=180 ``` ## Testing Guidelines ### Unit Test Structure ```python # tests/unit/test_youtube_service.py import pytest from unittest.mock import Mock, patch from src.services.youtube import YouTubeService @pytest.fixture def youtube_service(): return YouTubeService() def test_extract_video_id(youtube_service): urls = [ ("https://youtube.com/watch?v=abc123", "abc123"), ("https://youtu.be/xyz789", "xyz789"), ("https://www.youtube.com/embed/qwe456", "qwe456") ] for url, expected_id in urls: assert youtube_service.extract_video_id(url) == expected_id ``` ### Integration Test Pattern ```python # tests/integration/test_api.py from fastapi.testclient import TestClient from src.main import app client = TestClient(app) def test_summarize_endpoint(): response = client.post("/api/summarize", json={ "url": "https://youtube.com/watch?v=test123", "model": "openai" }) assert response.status_code == 200 assert "job_id" in response.json() ``` ## Performance Optimization 1. **Async Everything**: Use async/await for all I/O operations 2. **Background Tasks**: Process summaries in background 3. **Caching Layers**: - Memory cache for hot data - Database cache for persistence - CDN for static exports 4. **Rate Limiting**: Implement per-IP and per-user limits 5. **Token Optimization**: - Chunk long transcripts - Use map-reduce for summaries - Implement progressive summarization ## Security Considerations 1. **Input Validation**: Validate all YouTube URLs 2. **API Key Management**: Use environment variables, never commit keys 3. **Rate Limiting**: Prevent abuse and API exhaustion 4. **CORS Configuration**: Restrict to known domains in production 5. **SQL Injection Prevention**: Use parameterized queries 6. **XSS Protection**: Sanitize all user inputs 7. **Authentication**: Implement JWT for user sessions (Phase 3) ## Common Issues and Solutions ### Issue: Transcript Not Available ```python # Solution: Implement fallback chain try: transcript = await get_youtube_transcript(video_id) except TranscriptNotAvailable: # Try auto-generated captions transcript = await get_auto_captions(video_id) if not transcript: # Use audio transcription as last resort transcript = await transcribe_audio(video_id) ``` ### Issue: Token Limit Exceeded ```python # Solution: Implement chunking def chunk_transcript(transcript, max_tokens=3000): chunks = [] current_chunk = [] current_tokens = 0 for segment in transcript: segment_tokens = count_tokens(segment) if current_tokens + segment_tokens > max_tokens: chunks.append(current_chunk) current_chunk = [segment] current_tokens = segment_tokens else: current_chunk.append(segment) current_tokens += segment_tokens if current_chunk: chunks.append(current_chunk) return chunks ``` ### Issue: Rate Limiting ```python # Solution: Implement exponential backoff import asyncio from typing import Optional async def retry_with_backoff( func, max_retries: int = 3, initial_delay: float = 1.0 ) -> Optional[Any]: delay = initial_delay for attempt in range(max_retries): try: return await func() except RateLimitError: if attempt == max_retries - 1: raise await asyncio.sleep(delay) delay *= 2 # Exponential backoff ``` ## Development Tips 1. **Start with Task 1**: Setup and environment configuration 2. **Test Early**: Write tests as you implement features 3. **Use Type Hints**: Improve code quality and IDE support 4. **Document APIs**: Use FastAPI's automatic documentation 5. **Log Everything**: Implement comprehensive logging for debugging 6. **Cache Aggressively**: Reduce API calls and improve response times 7. **Handle Errors Gracefully**: Provide helpful error messages to users ## Task Master Integration This project uses Task Master for task management. Key commands: ```bash # View current progress task-master list # Get detailed task info task-master show 1 # Expand task into subtasks task-master expand --id=1 --research # Update task with progress task-master update-task --id=1 --prompt="Completed API structure" # Complete task task-master set-status --id=1 --status=done ``` ## BMad Method Documentation Structure ### Core Documentation - **[Project README](README.md)** - General project information and setup - **[Architecture](docs/architecture.md)** - Complete technical architecture specification - **[Front-End Spec](docs/front-end-spec.md)** - UI/UX requirements and component specifications - **[Original PRD](docs/prd.md)** - Complete product requirements document ### Epic and Story Management - **[Epic Index](docs/prd/index.md)** - Epic overview and progress tracking - **[Epic 1](docs/prd/epic-1-foundation-core-youtube-integration.md)** - Foundation epic details - **[Epic 2](docs/prd/epic-2-ai-summarization-engine.md)** - AI engine epic details - **[Epic 3](docs/prd/epic-3-enhanced-user-experience.md)** - Advanced features epic - **[Stories](docs/stories/)** - Individual story implementations ### Current Story Files **Epic 1 - Foundation (Sprint 1)**: - **[Story 1.1](docs/stories/1.1.project-setup-infrastructure.md)** - ✅ Project setup (COMPLETED) - **[Story 1.2](docs/stories/1.2.youtube-url-validation-parsing.md)** - 📋 URL validation (READY) - **[Story 1.3](docs/stories/1.3.transcript-extraction-service.md)** - 📋 Transcript extraction (READY) - **[Story 1.4](docs/stories/1.4.basic-web-interface.md)** - 📋 Web interface (READY) **Epic 2 - AI Engine (Sprints 2-3)**: - **[Story 2.1](docs/stories/2.1.single-ai-model-integration.md)** - 📋 OpenAI integration (READY) - **[Story 2.2](docs/stories/2.2.summary-generation-pipeline.md)** - 📋 Pipeline orchestration (READY) - **[Story 2.3](docs/stories/2.3.caching-system-implementation.md)** - 📋 Caching system (READY) - **[Story 2.4](docs/stories/2.4.multi-model-support.md)** - 📋 Multi-model AI (READY) - **[Story 2.5](docs/stories/2.5.export-functionality.md)** - 📋 Export features (READY) ### Development Workflow 1. **Check Epic Progress**: Review [Epic Index](docs/prd/index.md) for current status 2. **Review Next Story**: Read story file for implementation details 3. **Follow Dev Notes**: Use architecture references and technical specifications 4. **Implement & Test**: Follow story tasks/subtasks systematically 5. **Update Progress**: Mark story complete and update epic status ### Story-Based Implementation Priority **Current Focus**: Epic 1 - Foundation & Core YouTube Integration **Sprint 1 (Weeks 1-2)** - Epic 1 Implementation: 1. **Story 1.2** - YouTube URL Validation and Parsing (8-12 hours) ⬅️ **START HERE** 2. **Story 1.3** - Transcript Extraction Service (16-20 hours) 3. **Story 1.4** - Basic Web Interface (16-24 hours) **Sprint 2 (Weeks 3-4)** - Epic 2 Core: 4. **Story 2.1** - Single AI Model Integration (12-16 hours) 5. **Story 2.2** - Summary Generation Pipeline (16-20 hours) 6. **Story 2.3** - Caching System Implementation (12-16 hours) **Sprint 3 (Weeks 5-6)** - Epic 2 Advanced: 7. **Story 2.4** - Multi-Model Support (16-20 hours) 8. **Story 2.5** - Export Functionality (12-16 hours) **Developer Resources**: - [Developer Handoff Guide](docs/DEVELOPER_HANDOFF.md) - Start here for implementation - [Sprint Planning](docs/SPRINT_PLANNING.md) - Detailed sprint breakdown - [Story Files](docs/stories/) - All stories with complete Dev Notes --- *This guide is specifically tailored for Claude Code development on the YouTube Summarizer project.*