# AGENTS.md - YouTube Summarizer Development Standards This document defines development workflows, standards, and best practices for the YouTube Summarizer project. It serves as a guide for both human developers and AI agents working on this codebase. ## Table of Contents 1. [Development Workflow](#1-development-workflow) 2. [Code Standards](#2-code-standards) 3. [Testing Requirements](#3-testing-requirements) 4. [Documentation Standards](#4-documentation-standards) 5. [Git Workflow](#5-git-workflow) 6. [API Design Standards](#6-api-design-standards) 7. [Database Operations](#7-database-operations) 8. [Performance Guidelines](#8-performance-guidelines) 9. [Security Protocols](#9-security-protocols) 10. [Deployment Process](#10-deployment-process) ## 1. Development Workflow ### Task-Driven Development All development follows the Task Master workflow: ```bash # 1. Get next task task-master next # 2. Review task details task-master show # 3. Expand if needed task-master expand --id= --research # 4. Set to in-progress task-master set-status --id= --status=in-progress # 5. Implement feature # ... code implementation ... # 6. Test implementation pytest tests/ # 7. Update task with notes task-master update-subtask --id= --prompt="Implementation details..." # 8. Mark complete task-master set-status --id= --status=done ``` ### Feature Implementation Checklist - [ ] Review task requirements - [ ] Write unit tests first (TDD) - [ ] Implement feature - [ ] Add integration tests - [ ] Update documentation - [ ] Run full test suite - [ ] Update task status - [ ] Commit with descriptive message ## 2. Code Standards ### Python Style Guide ```python """ Module docstring describing purpose and usage """ from typing import List, Optional, Dict, Any import asyncio from datetime import datetime # Constants in UPPER_CASE DEFAULT_TIMEOUT = 30 MAX_RETRIES = 3 class YouTubeSummarizer: """ Class for summarizing YouTube videos. Attributes: model: AI model to use for summarization cache: Cache service instance """ def __init__(self, model: str = "openai"): """Initialize summarizer with specified model.""" self.model = model self.cache = CacheService() async def summarize( self, video_url: str, options: Optional[Dict[str, Any]] = None ) -> Dict[str, Any]: """ Summarize a YouTube video. Args: video_url: YouTube video URL options: Optional summarization parameters Returns: Dictionary containing summary and metadata Raises: YouTubeError: If video cannot be accessed AIServiceError: If summarization fails """ # Implementation here pass ``` ### Type Hints Always use type hints for better code quality: ```python from typing import Union, List, Optional, Dict, Any, Tuple from pydantic import BaseModel, HttpUrl async def process_video( url: HttpUrl, models: List[str], max_length: Optional[int] = None ) -> Tuple[str, Dict[str, Any]]: """Process video with type safety.""" pass ``` ### Async/Await Pattern Use async for all I/O operations: ```python async def fetch_transcript(video_id: str) -> str: """Fetch transcript asynchronously.""" async with aiohttp.ClientSession() as session: async with session.get(url) as response: return await response.text() # Use asyncio.gather for parallel operations results = await asyncio.gather( fetch_transcript(id1), fetch_transcript(id2), fetch_transcript(id3) ) ``` ## 3. Testing Requirements ### Test Structure ``` tests/ ├── unit/ │ ├── test_youtube_service.py │ ├── test_summarizer_service.py │ └── test_cache_service.py ├── integration/ │ ├── test_api_endpoints.py │ └── test_database.py ├── fixtures/ │ ├── sample_transcripts.json │ └── mock_responses.py └── conftest.py ``` ### Unit Test Example ```python # tests/unit/test_youtube_service.py import pytest from unittest.mock import Mock, patch, AsyncMock from src.services.youtube import YouTubeService class TestYouTubeService: @pytest.fixture def youtube_service(self): return YouTubeService() @pytest.fixture def mock_transcript(self): return [ {"text": "Hello world", "start": 0.0, "duration": 2.0}, {"text": "This is a test", "start": 2.0, "duration": 3.0} ] @pytest.mark.asyncio async def test_extract_transcript_success( self, youtube_service, mock_transcript ): with patch('youtube_transcript_api.YouTubeTranscriptApi.get_transcript') as mock_get: mock_get.return_value = mock_transcript result = await youtube_service.extract_transcript("test_id") assert result == mock_transcript mock_get.assert_called_once_with("test_id") def test_extract_video_id_various_formats(self, youtube_service): test_cases = [ ("https://www.youtube.com/watch?v=abc123", "abc123"), ("https://youtu.be/xyz789", "xyz789"), ("https://youtube.com/embed/qwe456", "qwe456"), ("https://www.youtube.com/watch?v=test&t=123", "test") ] for url, expected_id in test_cases: assert youtube_service.extract_video_id(url) == expected_id ``` ### Integration Test Example ```python # tests/integration/test_api_endpoints.py import pytest from fastapi.testclient import TestClient from src.main import app @pytest.fixture def client(): return TestClient(app) class TestSummarizationAPI: @pytest.mark.asyncio async def test_summarize_endpoint(self, client): response = client.post("/api/summarize", json={ "url": "https://youtube.com/watch?v=test123", "model": "openai", "options": {"max_length": 500} }) assert response.status_code == 200 data = response.json() assert "job_id" in data assert data["status"] == "processing" @pytest.mark.asyncio async def test_get_summary(self, client): # First create a summary create_response = client.post("/api/summarize", json={ "url": "https://youtube.com/watch?v=test123" }) job_id = create_response.json()["job_id"] # Then retrieve it get_response = client.get(f"/api/summary/{job_id}") assert get_response.status_code in [200, 202] # 202 if still processing ``` ### Test Coverage Requirements - Minimum 80% code coverage - 100% coverage for critical paths - All edge cases tested - Error conditions covered ```bash # Run tests with coverage pytest tests/ --cov=src --cov-report=html --cov-report=term # Coverage report should show: # src/services/youtube.py 95% # src/services/summarizer.py 88% # src/api/routes.py 92% ``` ## 4. Documentation Standards ### Code Documentation Every module, class, and function must have docstrings: ```python """ Module: YouTube Transcript Extractor This module provides functionality to extract transcripts from YouTube videos using multiple fallback methods. Example: >>> extractor = TranscriptExtractor() >>> transcript = await extractor.extract("video_id") """ def extract_transcript( video_id: str, language: str = "en", include_auto_generated: bool = True ) -> List[Dict[str, Any]]: """ Extract transcript from YouTube video. This function attempts to extract transcripts using the following priority: 1. Manual captions in specified language 2. Auto-generated captions if allowed 3. Translated captions as fallback Args: video_id: YouTube video identifier language: ISO 639-1 language code (default: "en") include_auto_generated: Whether to use auto-generated captions Returns: List of transcript segments with text, start time, and duration Raises: TranscriptNotAvailable: If no transcript can be extracted Example: >>> transcript = extract_transcript("dQw4w9WgXcQ", "en") >>> print(transcript[0]) {"text": "Never gonna give you up", "start": 0.0, "duration": 3.5} """ pass ``` ### API Documentation Use FastAPI's automatic documentation features: ```python from fastapi import APIRouter, HTTPException, status from pydantic import BaseModel, Field router = APIRouter() class SummarizeRequest(BaseModel): """Request model for video summarization.""" url: str = Field( ..., description="YouTube video URL", example="https://youtube.com/watch?v=dQw4w9WgXcQ" ) model: str = Field( "auto", description="AI model to use (openai, anthropic, deepseek, auto)", example="openai" ) max_length: Optional[int] = Field( None, description="Maximum summary length in words", ge=50, le=5000 ) @router.post( "/summarize", response_model=SummarizeResponse, status_code=status.HTTP_200_OK, summary="Summarize YouTube Video", description="Submit a YouTube video URL for AI-powered summarization" ) async def summarize_video(request: SummarizeRequest): """ Summarize a YouTube video using AI. This endpoint accepts a YouTube URL and returns a job ID for tracking the summarization progress. Use the /summary/{job_id} endpoint to retrieve the completed summary. """ pass ``` ## 5. Git Workflow ### Branch Naming ```bash # Feature branches feature/task-2-youtube-extraction feature/task-3-ai-summarization # Bugfix branches bugfix/transcript-encoding-error bugfix/rate-limit-handling # Hotfix branches hotfix/critical-api-error ``` ### Commit Messages Follow conventional commits: ```bash # Format: (): # Examples: feat(youtube): add transcript extraction service fix(api): handle rate limiting correctly docs(readme): update installation instructions test(youtube): add edge case tests refactor(cache): optimize cache key generation perf(summarizer): implement parallel processing chore(deps): update requirements.txt ``` ### Pull Request Template ```markdown ## Task Reference - Task ID: #3 - Task Title: Develop AI Summary Generation Service ## Description Brief description of changes made ## Changes Made - [ ] Implemented YouTube transcript extraction - [ ] Added multi-model AI support - [ ] Created caching layer - [ ] Added comprehensive tests ## Testing - [ ] Unit tests pass - [ ] Integration tests pass - [ ] Manual testing completed - [ ] Coverage > 80% ## Documentation - [ ] Code documented - [ ] API docs updated - [ ] README updated if needed ## Screenshots (if applicable) [Add screenshots here] ``` ## 6. API Design Standards ### RESTful Principles ```python # Good API design GET /api/summaries # List all summaries GET /api/summaries/{id} # Get specific summary POST /api/summaries # Create new summary PUT /api/summaries/{id} # Update summary DELETE /api/summaries/{id} # Delete summary # Status codes 200 OK # Successful GET/PUT 201 Created # Successful POST 202 Accepted # Processing async request 204 No Content # Successful DELETE 400 Bad Request # Invalid input 401 Unauthorized # Missing/invalid auth 403 Forbidden # No permission 404 Not Found # Resource doesn't exist 429 Too Many Requests # Rate limited 500 Internal Error # Server error ``` ### Response Format ```python # Success response { "success": true, "data": { "id": "uuid", "video_id": "abc123", "summary": "...", "metadata": {} }, "timestamp": "2025-01-25T10:00:00Z" } # Error response { "success": false, "error": { "code": "TRANSCRIPT_NOT_AVAILABLE", "message": "Could not extract transcript from video", "details": "No captions available in requested language" }, "timestamp": "2025-01-25T10:00:00Z" } ``` ### Pagination ```python @router.get("/summaries") async def list_summaries( page: int = Query(1, ge=1), limit: int = Query(20, ge=1, le=100), sort: str = Query("created_at", regex="^(created_at|updated_at|title)$"), order: str = Query("desc", regex="^(asc|desc)$") ): """List summaries with pagination.""" return { "data": summaries, "pagination": { "page": page, "limit": limit, "total": total_count, "pages": math.ceil(total_count / limit) } } ``` ## 7. Database Operations ### SQLAlchemy Models ```python from sqlalchemy import Column, String, Text, DateTime, Float, JSON from sqlalchemy.ext.declarative import declarative_base from sqlalchemy.dialects.postgresql import UUID import uuid Base = declarative_base() class Summary(Base): __tablename__ = "summaries" id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) video_id = Column(String(20), nullable=False, index=True) video_url = Column(Text, nullable=False) video_title = Column(Text) transcript = Column(Text) summary = Column(Text) key_points = Column(JSON) chapters = Column(JSON) model_used = Column(String(50)) processing_time = Column(Float) created_at = Column(DateTime, default=datetime.utcnow) updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow) def to_dict(self): """Convert to dictionary for API responses.""" return { "id": str(self.id), "video_id": self.video_id, "video_title": self.video_title, "summary": self.summary, "key_points": self.key_points, "chapters": self.chapters, "model_used": self.model_used, "created_at": self.created_at.isoformat() } ``` ### Database Migrations Use Alembic for migrations: ```bash # Create new migration alembic revision --autogenerate -m "Add chapters column" # Apply migrations alembic upgrade head # Rollback alembic downgrade -1 ``` ### Query Optimization ```python from sqlalchemy import select, and_ from sqlalchemy.orm import selectinload # Efficient querying with joins async def get_summaries_with_metadata(session, user_id: str): stmt = ( select(Summary) .options(selectinload(Summary.metadata)) .where(Summary.user_id == user_id) .order_by(Summary.created_at.desc()) .limit(10) ) result = await session.execute(stmt) return result.scalars().all() ``` ## 8. Performance Guidelines ### Caching Strategy ```python from functools import lru_cache import redis import hashlib import json class CacheService: def __init__(self): self.redis = redis.Redis(decode_responses=True) self.ttl = 3600 # 1 hour default def get_key(self, prefix: str, **kwargs) -> str: """Generate cache key from parameters.""" data = json.dumps(kwargs, sort_keys=True) hash_digest = hashlib.md5(data.encode()).hexdigest() return f"{prefix}:{hash_digest}" async def get_or_set(self, key: str, func, ttl: int = None): """Get from cache or compute and set.""" # Try cache first cached = self.redis.get(key) if cached: return json.loads(cached) # Compute result result = await func() # Cache result self.redis.setex( key, ttl or self.ttl, json.dumps(result) ) return result ``` ### Async Processing ```python from celery import Celery from typing import Dict, Any celery_app = Celery('youtube_summarizer') @celery_app.task async def process_video_task(video_url: str, options: Dict[str, Any]): """Background task for video processing.""" try: # Extract transcript transcript = await extract_transcript(video_url) # Generate summary summary = await generate_summary(transcript, options) # Save to database await save_summary(video_url, summary) return {"status": "completed", "summary_id": summary.id} except Exception as e: return {"status": "failed", "error": str(e)} ``` ### Performance Monitoring ```python import time from functools import wraps import logging logger = logging.getLogger(__name__) def measure_performance(func): """Decorator to measure function performance.""" @wraps(func) async def wrapper(*args, **kwargs): start = time.perf_counter() try: result = await func(*args, **kwargs) elapsed = time.perf_counter() - start logger.info(f"{func.__name__} took {elapsed:.3f}s") return result except Exception as e: elapsed = time.perf_counter() - start logger.error(f"{func.__name__} failed after {elapsed:.3f}s: {e}") raise return wrapper ``` ## 9. Security Protocols ### Input Validation ```python from pydantic import BaseModel, validator, HttpUrl import re class VideoURLValidator(BaseModel): url: HttpUrl @validator('url') def validate_youtube_url(cls, v): youtube_regex = re.compile( r'(https?://)?(www\.)?(youtube\.com|youtu\.be)/.+' ) if not youtube_regex.match(str(v)): raise ValueError('Invalid YouTube URL') return v ``` ### API Key Management ```python from pydantic import BaseSettings class Settings(BaseSettings): """Application settings with validation.""" # API Keys (never hardcode!) openai_api_key: str anthropic_api_key: str youtube_api_key: Optional[str] = None # Security secret_key: str allowed_origins: List[str] = ["http://localhost:3000"] class Config: env_file = ".env" env_file_encoding = "utf-8" case_sensitive = False settings = Settings() ``` ### Rate Limiting ```python from fastapi import Request, HTTPException from fastapi_limiter import FastAPILimiter from fastapi_limiter.depends import RateLimiter import redis.asyncio as redis # Initialize rate limiter async def init_rate_limiter(): redis_client = redis.from_url("redis://localhost:6379", encoding="utf-8", decode_responses=True) await FastAPILimiter.init(redis_client) # Apply rate limiting @router.post("/summarize", dependencies=[Depends(RateLimiter(times=10, seconds=60))]) async def summarize_video(request: SummarizeRequest): """Rate limited to 10 requests per minute.""" pass ``` ## 10. Deployment Process ### Docker Configuration ```dockerfile # Dockerfile FROM python:3.11-slim WORKDIR /app # Install dependencies COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy application COPY . . # Run application CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8082"] ``` ### Environment Management ```bash # .env.development DEBUG=true DATABASE_URL=sqlite:///./dev.db LOG_LEVEL=DEBUG # .env.production DEBUG=false DATABASE_URL=postgresql://user:pass@db:5432/youtube_summarizer LOG_LEVEL=INFO ``` ### Health Checks ```python @router.get("/health") async def health_check(): """Health check endpoint for monitoring.""" checks = { "api": "healthy", "database": await check_database(), "cache": await check_cache(), "ai_service": await check_ai_service() } all_healthy = all(v == "healthy" for v in checks.values()) return { "status": "healthy" if all_healthy else "degraded", "checks": checks, "timestamp": datetime.utcnow().isoformat() } ``` ### Monitoring ```python from prometheus_client import Counter, Histogram, generate_latest # Metrics request_count = Counter('youtube_requests_total', 'Total requests') request_duration = Histogram('youtube_request_duration_seconds', 'Request duration') summary_generation_time = Histogram('summary_generation_seconds', 'Summary generation time') @router.get("/metrics") async def metrics(): """Prometheus metrics endpoint.""" return Response(generate_latest(), media_type="text/plain") ``` ## Agent-Specific Instructions ### For AI Agents When working on this codebase: 1. **Always check Task Master first**: `task-master next` 2. **Follow TDD**: Write tests before implementation 3. **Use type hints**: All functions must have type annotations 4. **Document changes**: Update docstrings and comments 5. **Test thoroughly**: Run full test suite before marking complete 6. **Update task status**: Keep Task Master updated with progress ### Quality Checklist Before marking any task as complete: - [ ] All tests pass (`pytest tests/`) - [ ] Code coverage > 80% (`pytest --cov=src`) - [ ] No linting errors (`ruff check src/`) - [ ] Type checking passes (`mypy src/`) - [ ] Documentation updated - [ ] Task Master updated - [ ] Changes committed with proper message ## Conclusion This guide ensures consistent, high-quality development across all contributors to the YouTube Summarizer project. Follow these standards to maintain code quality, performance, and security. --- *Last Updated: 2025-01-25* *Version: 1.0.0*