1211 lines
37 KiB
Markdown
1211 lines
37 KiB
Markdown
# AGENTS.md - YouTube Summarizer Development Standards
|
|
|
|
This document defines development workflows, standards, and best practices for the YouTube Summarizer project. It serves as a guide for both human developers and AI agents working on this codebase.
|
|
|
|
## 🚨 CRITICAL: Server Status Checking Protocol
|
|
|
|
**MANDATORY**: Check server status before ANY testing or debugging:
|
|
|
|
```bash
|
|
# 1. ALWAYS CHECK server status FIRST
|
|
lsof -i :3002 | grep LISTEN # Check frontend (expected port)
|
|
lsof -i :8000 | grep LISTEN # Check backend (expected port)
|
|
|
|
# 2. If servers NOT running, RESTART them
|
|
cd /Users/enias/projects/my-ai-projects/apps/youtube-summarizer
|
|
./scripts/restart-frontend.sh # After frontend changes
|
|
./scripts/restart-backend.sh # After backend changes
|
|
./scripts/restart-both.sh # After changes to both
|
|
|
|
# 3. VERIFY restart was successful
|
|
lsof -i :3002 | grep LISTEN # Should show node process
|
|
lsof -i :8000 | grep LISTEN # Should show python process
|
|
|
|
# 4. ONLY THEN proceed with testing
|
|
```
|
|
|
|
**Server Checking Rules**:
|
|
- ✅ ALWAYS check server status before testing
|
|
- ✅ ALWAYS restart servers after code changes
|
|
- ✅ ALWAYS verify restart was successful
|
|
- ❌ NEVER assume servers are running
|
|
- ❌ NEVER test without confirming server status
|
|
- ❌ NEVER debug "errors" without checking if server is running
|
|
|
|
## 🚨 CRITICAL: Documentation Preservation Rule
|
|
|
|
**MANDATORY**: Preserve critical documentation sections:
|
|
|
|
- ❌ **NEVER** remove critical sections from CLAUDE.md or AGENTS.md
|
|
- ❌ **NEVER** delete server checking protocols or development standards
|
|
- ❌ **NEVER** remove established workflows or troubleshooting guides
|
|
- ❌ **NEVER** delete testing procedures or quality standards
|
|
- ✅ **ONLY** remove sections when explicitly instructed by the user
|
|
- ✅ **ALWAYS** preserve and enhance existing documentation
|
|
|
|
## 🚩 CRITICAL: Directory Awareness Protocol
|
|
|
|
**MANDATORY BEFORE ANY COMMAND**: ALWAYS verify your current working directory before running any command.
|
|
|
|
```bash
|
|
# ALWAYS run this first before ANY command
|
|
pwd
|
|
|
|
# Expected result for YouTube Summarizer:
|
|
# /Users/enias/projects/my-ai-projects/apps/youtube-summarizer
|
|
```
|
|
|
|
#### Critical Directory Rules
|
|
- **NEVER assume** you're in the correct directory
|
|
- **ALWAYS verify** with `pwd` before running commands
|
|
- **YouTube Summarizer development** requires being in `/Users/enias/projects/my-ai-projects/apps/youtube-summarizer`
|
|
- **Backend server** (`python3 backend/main.py`) must be run from YouTube Summarizer root
|
|
- **Frontend development** (`npm run dev`) must be run from YouTube Summarizer root
|
|
- **Database operations** and migrations will fail if run from wrong directory
|
|
|
|
#### YouTube Summarizer Directory Verification
|
|
```bash
|
|
# ❌ WRONG - Running from main project or apps directory
|
|
cd /Users/enias/projects/my-ai-projects
|
|
python3 backend/main.py # Will fail - backend/ doesn't exist here
|
|
|
|
cd /Users/enias/projects/my-ai-projects/apps
|
|
python3 main.py # Will fail - no main.py in apps/
|
|
|
|
# ✅ CORRECT - Always navigate to YouTube Summarizer
|
|
cd /Users/enias/projects/my-ai-projects/apps/youtube-summarizer
|
|
pwd # Verify: /Users/enias/projects/my-ai-projects/apps/youtube-summarizer
|
|
python3 backend/main.py # Backend server
|
|
# OR
|
|
python3 main.py # Alternative entry point
|
|
```
|
|
|
|
## 🚀 Quick Start for Developers
|
|
|
|
**All stories are created and ready for implementation!**
|
|
|
|
1. **Start Here**: [Developer Handoff Guide](docs/DEVELOPER_HANDOFF.md)
|
|
2. **Sprint Plan**: [Sprint Planning Document](docs/SPRINT_PLANNING.md)
|
|
3. **First Story**: [Story 1.2 - URL Validation](docs/stories/1.2.youtube-url-validation-parsing.md)
|
|
|
|
**Total Implementation Time**: ~6 weeks (3 sprints)
|
|
- Sprint 1: Epic 1 (Foundation) - Stories 1.2-1.4
|
|
- Sprint 2: Epic 2 Core - Stories 2.1-2.3
|
|
- Sprint 3: Epic 2 Advanced - Stories 2.4-2.5
|
|
|
|
## Table of Contents
|
|
1. [Development Workflow](#1-development-workflow)
|
|
2. [Code Standards](#2-code-standards)
|
|
3. [Testing Requirements](#3-testing-requirements)
|
|
4. [Documentation Standards](#4-documentation-standards)
|
|
5. [Git Workflow](#5-git-workflow)
|
|
6. [API Design Standards](#6-api-design-standards)
|
|
7. [Database Operations](#7-database-operations)
|
|
8. [Performance Guidelines](#8-performance-guidelines)
|
|
9. [Security Protocols](#9-security-protocols)
|
|
10. [Deployment Process](#10-deployment-process)
|
|
|
|
## 🚨 CRITICAL: Documentation Update Rule
|
|
|
|
**MANDATORY**: After completing significant coding work, automatically update ALL documentation:
|
|
|
|
### Documentation Update Protocol
|
|
1. **After Feature Implementation** → Update relevant documentation files:
|
|
- **CLAUDE.md** - Development guidance and protocols
|
|
- **AGENTS.md** (this file) - Development standards and workflows
|
|
- **README.md** - User-facing features and setup instructions
|
|
- **CHANGELOG.md** - Version history and changes
|
|
- **FILE_STRUCTURE.md** - Directory structure and file organization
|
|
|
|
### When to Update Documentation
|
|
- ✅ **After implementing new features** → Update all relevant docs
|
|
- ✅ **After fixing significant bugs** → Update troubleshooting guides
|
|
- ✅ **After changing architecture** → Update CLAUDE.md, AGENTS.md, FILE_STRUCTURE.md
|
|
- ✅ **After adding new tools/scripts** → Update CLAUDE.md, AGENTS.md, README.md
|
|
- ✅ **After configuration changes** → Update setup documentation
|
|
- ✅ **At end of development sessions** → Comprehensive doc review
|
|
|
|
### Documentation Workflow Integration
|
|
```bash
|
|
# After completing significant code changes:
|
|
# 1. Test changes work
|
|
./scripts/restart-backend.sh # Test backend changes
|
|
./scripts/restart-frontend.sh # Test frontend changes (if needed)
|
|
# 2. Update relevant documentation files
|
|
# 3. Commit documentation with code changes
|
|
git add CLAUDE.md AGENTS.md README.md CHANGELOG.md FILE_STRUCTURE.md
|
|
git commit -m "feat: implement feature X with documentation updates"
|
|
```
|
|
|
|
### Documentation Standards
|
|
- **Format**: Use clear headings, code blocks, and examples
|
|
- **Timeliness**: Update immediately after code changes
|
|
- **Completeness**: Cover all user-facing and developer-facing changes
|
|
- **Consistency**: Maintain same format across all documentation files
|
|
|
|
## 1. Development Workflow
|
|
|
|
### Story-Driven Development (BMad Method)
|
|
|
|
All development follows the BMad Method epic and story workflow:
|
|
|
|
**Current Development Status: READY FOR IMPLEMENTATION**
|
|
- **Epic 1**: Foundation & Core YouTube Integration (Story 1.1 ✅ Complete, Stories 1.2-1.4 📋 Ready)
|
|
- **Epic 2**: AI Summarization Engine (Stories 2.1-2.5 📋 All Created and Ready)
|
|
- **Epic 3**: Enhanced User Experience (Future - Ready for story creation)
|
|
|
|
**Developer Handoff Complete**: All Epic 1 & 2 stories created with comprehensive Dev Notes.
|
|
- See [Developer Handoff Guide](docs/DEVELOPER_HANDOFF.md) for implementation start
|
|
- See [Sprint Planning](docs/SPRINT_PLANNING.md) for 6-week development schedule
|
|
|
|
#### Story-Based Implementation Process
|
|
|
|
```bash
|
|
# 1. Start with Developer Handoff
|
|
cat docs/DEVELOPER_HANDOFF.md # Complete implementation guide
|
|
cat docs/SPRINT_PLANNING.md # Sprint breakdown
|
|
|
|
# 2. Get Your Next Story (All stories ready!)
|
|
# Sprint 1: Stories 1.2, 1.3, 1.4
|
|
# Sprint 2: Stories 2.1, 2.2, 2.3
|
|
# Sprint 3: Stories 2.4, 2.5
|
|
|
|
# 3. Review Story Implementation Requirements
|
|
# Read: docs/stories/{story-number}.{name}.md
|
|
# Example: docs/stories/1.2.youtube-url-validation-parsing.md
|
|
# Study: Dev Notes section with complete code examples
|
|
# Check: All tasks and subtasks with time estimates
|
|
|
|
# 4. Implement Story
|
|
# Option A: Use Development Agent
|
|
/BMad:agents:dev
|
|
# Follow story specifications exactly
|
|
|
|
# Option B: Direct implementation
|
|
# Use code examples from Dev Notes
|
|
# Follow file structure specified in story
|
|
# Implement tasks in order
|
|
|
|
# 5. Test Implementation (Comprehensive Test Runner)
|
|
./run_tests.sh run-unit --fail-fast # Ultra-fast feedback (229 tests)
|
|
./run_tests.sh run-specific "test_{module}.py" # Test specific modules
|
|
./run_tests.sh run-integration # Integration & API tests
|
|
./run_tests.sh run-all --coverage # Full validation with coverage
|
|
cd frontend && npm test
|
|
|
|
# 6. Server Restart Protocol (CRITICAL FOR BACKEND CHANGES)
|
|
# ALWAYS restart backend after modifying Python files:
|
|
./scripts/restart-backend.sh # After backend code changes
|
|
./scripts/restart-frontend.sh # After npm installs or config changes
|
|
./scripts/restart-both.sh # Full stack restart
|
|
# Frontend HMR handles React changes automatically - no restart needed
|
|
|
|
# 7. Update Story Progress
|
|
# In story file, mark tasks complete:
|
|
# - [x] **Task 1: Completed task**
|
|
# Update story status: Draft → In Progress → Review → Done
|
|
|
|
# 7. Move to Next Story
|
|
# Check Sprint Planning for next priority
|
|
# Repeat process with next story file
|
|
```
|
|
|
|
#### Alternative: Direct Development (Without BMad Agents)
|
|
|
|
```bash
|
|
# 1. Read current story specification
|
|
cat docs/stories/1.2.youtube-url-validation-parsing.md
|
|
|
|
# 2. Follow Dev Notes and architecture references
|
|
cat docs/architecture.md # Technical specifications
|
|
cat docs/front-end-spec.md # UI requirements
|
|
|
|
# 3. Implement systematically
|
|
# Follow tasks/subtasks exactly as specified
|
|
# Use provided code examples and patterns
|
|
|
|
# 4. Test and validate (Test Runner System)
|
|
./run_tests.sh run-unit --fail-fast # Fast feedback during development
|
|
./run_tests.sh run-all --coverage # Complete validation before story completion
|
|
cd frontend && npm test
|
|
```
|
|
|
|
### Story Implementation Checklist (BMad Method)
|
|
|
|
- [ ] **Review Story Requirements**
|
|
- [ ] Read complete story file (`docs/stories/{epic}.{story}.{name}.md`)
|
|
- [ ] Study Dev Notes section with architecture references
|
|
- [ ] Understand all acceptance criteria
|
|
- [ ] Review all tasks and subtasks
|
|
|
|
- [ ] **Follow Architecture Specifications**
|
|
- [ ] Reference `docs/architecture.md` for technical patterns
|
|
- [ ] Use exact file locations specified in story
|
|
- [ ] Follow error handling patterns from architecture
|
|
- [ ] Implement according to database schema specifications
|
|
|
|
- [ ] **Write Tests First (TDD)**
|
|
- [ ] Create unit tests based on story testing requirements
|
|
- [ ] Write integration tests for API endpoints
|
|
- [ ] Add frontend component tests where specified
|
|
- [ ] Ensure test coverage meets story requirements
|
|
|
|
- [ ] **Implement Features Systematically**
|
|
- [ ] Complete tasks in order specified in story
|
|
- [ ] Follow code examples and patterns from Dev Notes
|
|
- [ ] Use exact imports and dependencies specified
|
|
- [ ] Implement error handling as architecturally defined
|
|
|
|
- [ ] **Validate Implementation**
|
|
- [ ] All acceptance criteria met
|
|
- [ ] All tasks/subtasks completed
|
|
- [ ] Full test suite passes
|
|
- [ ] Integration testing successful
|
|
|
|
- [ ] **Update Story Progress**
|
|
- [ ] Mark tasks complete in story markdown file
|
|
- [ ] Update story status from "Draft" to "Done"
|
|
- [ ] Add completion notes to Dev Agent Record section
|
|
- [ ] Update epic progress in `docs/prd/index.md`
|
|
|
|
- [ ] **Commit Changes**
|
|
- [ ] Use story-based commit message format
|
|
- [ ] Reference story number in commit
|
|
- [ ] Include brief implementation summary
|
|
|
|
## FILE LENGTH - Keep All Files Modular and Focused
|
|
|
|
### 300 Lines of Code Limit
|
|
|
|
**CRITICAL RULE**: We must keep all files under 300 LOC.
|
|
|
|
- **Current Status**: Many files in our codebase break this rule
|
|
- **Requirement**: Files must be modular & single-purpose
|
|
- **Enforcement**: Before adding any significant functionality, check file length
|
|
- **Action Required**: Refactor any file approaching or exceeding 300 lines
|
|
|
|
```bash
|
|
# Check file lengths across project
|
|
find . -name "*.py" -not -path "*/venv*/*" -not -path "*/__pycache__/*" -exec wc -l {} + | awk '$1 > 300'
|
|
find . -name "*.ts" -name "*.tsx" -not -path "*/node_modules/*" -exec wc -l {} + | awk '$1 > 300'
|
|
```
|
|
|
|
**Modularization Strategies**:
|
|
- Extract utility functions into separate modules
|
|
- Split large classes into focused, single-responsibility classes
|
|
- Move constants and configuration to dedicated files
|
|
- Separate concerns: logic, data models, API handlers
|
|
- Use composition over inheritance to reduce file complexity
|
|
|
|
**Examples of Files Needing Refactoring**:
|
|
- Large service files → Split into focused service modules
|
|
- Complex API routers → Extract handlers to separate modules
|
|
- Monolithic components → Break into smaller, composable components
|
|
- Combined model files → Separate by entity or domain
|
|
|
|
## READING FILES - Never Make Assumptions
|
|
|
|
### Always Read Files in Full Before Changes
|
|
|
|
**CRITICAL RULE**: Always read the file in full, do not be lazy.
|
|
|
|
- **Before making ANY code changes**: Start by finding & reading ALL relevant files
|
|
- **Never make changes without reading the entire file**: Understand context, existing patterns, dependencies
|
|
- **Read related files**: Check imports, dependencies, and related modules
|
|
- **Understand existing architecture**: Follow established patterns and conventions
|
|
|
|
```bash
|
|
# Investigation checklist before any code changes:
|
|
# 1. Read the target file completely
|
|
# 2. Read all imported modules
|
|
# 3. Check related test files
|
|
# 4. Review configuration files
|
|
# 5. Understand data models and schemas
|
|
```
|
|
|
|
**File Reading Protocol**:
|
|
1. **Target File**: Read entire file to understand current implementation
|
|
2. **Dependencies**: Read all imported modules and their interfaces
|
|
3. **Tests**: Check existing test coverage and patterns
|
|
4. **Related Files**: Review files in same directory/module
|
|
5. **Configuration**: Check relevant config files and environment variables
|
|
6. **Documentation**: Read any related documentation or comments
|
|
|
|
**Common Mistakes to Avoid**:
|
|
- ❌ Making changes based on file names alone
|
|
- ❌ Assuming function behavior without reading implementation
|
|
- ❌ Not understanding existing error handling patterns
|
|
- ❌ Missing important configuration or environment dependencies
|
|
- ❌ Ignoring existing test patterns and coverage
|
|
|
|
## EGO - Engineering Humility and Best Practices
|
|
|
|
### Do Not Make Assumptions - Consider Multiple Approaches
|
|
|
|
**CRITICAL MINDSET**: Do not make assumptions. Do not jump to conclusions.
|
|
|
|
- **Reality Check**: You are just a Large Language Model, you are very limited
|
|
- **Engineering Approach**: Always consider multiple different approaches, just like a senior engineer
|
|
- **Validate Assumptions**: Test your understanding against the actual codebase
|
|
- **Seek Understanding**: When unclear, read more files and investigate thoroughly
|
|
|
|
**Senior Engineer Mindset**:
|
|
```
|
|
1. **Multiple Solutions**: Always consider 2-3 different approaches
|
|
2. **Trade-off Analysis**: Evaluate pros/cons of each approach
|
|
3. **Existing Patterns**: Follow established codebase patterns
|
|
4. **Future Maintenance**: Consider long-term maintainability
|
|
5. **Performance Impact**: Consider resource and performance implications
|
|
6. **Testing Strategy**: Plan testing approach before implementation
|
|
```
|
|
|
|
**Before Implementation, Ask**:
|
|
- What are 2-3 different ways to solve this?
|
|
- What are the trade-offs of each approach?
|
|
- How does this fit with existing architecture patterns?
|
|
- What could break if this implementation is wrong?
|
|
- How would a senior engineer approach this problem?
|
|
- What edge cases am I not considering?
|
|
|
|
**Decision Process**:
|
|
1. **Gather Information**: Read all relevant files and understand context
|
|
2. **Generate Options**: Consider multiple implementation approaches
|
|
3. **Evaluate Trade-offs**: Analyze pros/cons of each option
|
|
4. **Check Patterns**: Ensure consistency with existing codebase
|
|
5. **Plan Testing**: Design test strategy to validate approach
|
|
6. **Implement Incrementally**: Start small, verify, then expand
|
|
|
|
**Remember Your Limitations**:
|
|
- Cannot execute code to verify behavior
|
|
- Cannot access external documentation beyond what's provided
|
|
- Cannot make network requests or test integrations
|
|
- Cannot guarantee code will work without testing
|
|
- Limited understanding of complex business logic
|
|
|
|
**Compensation Strategies**:
|
|
- Read more files when uncertain
|
|
- Follow established patterns rigorously
|
|
- Provide multiple implementation options
|
|
- Document assumptions and limitations
|
|
- Suggest verification steps for humans
|
|
- Request feedback on complex architectural decisions
|
|
|
|
## Class Library Integration and Usage
|
|
|
|
### AI Assistant Class Library Reference
|
|
|
|
This project uses the shared AI Assistant Class Library (`/lib/`) which provides foundational components for AI applications. Always check the class library first before implementing common functionality.
|
|
|
|
#### Core Library Components Used:
|
|
|
|
**Service Framework** (`/lib/services/`):
|
|
```python
|
|
from ai_assistant_lib import BaseService, BaseAIService, ServiceStatus
|
|
|
|
# Backend services inherit from library base classes
|
|
class VideoService(BaseService):
|
|
async def _initialize_impl(self) -> None:
|
|
# Service-specific initialization with lifecycle management
|
|
pass
|
|
|
|
class AnthropicSummarizer(BaseAIService):
|
|
# Inherits retry logic, caching, rate limiting from library
|
|
async def _make_prediction(self, request: AIRequest) -> AIResponse:
|
|
pass
|
|
```
|
|
|
|
**Repository Pattern** (`/lib/data/repositories/`):
|
|
```python
|
|
from ai_assistant_lib import BaseRepository, TimestampedModel
|
|
|
|
# Database models use library base classes
|
|
class Summary(TimestampedModel):
|
|
# Automatic created_at, updated_at fields
|
|
__tablename__ = 'summaries'
|
|
|
|
class SummaryRepository(BaseRepository[Summary]):
|
|
# Inherits CRUD operations, filtering, pagination
|
|
async def find_by_video_id(self, video_id: str) -> Optional[Summary]:
|
|
filters = {"video_id": video_id}
|
|
results = await self.find_all(filters=filters, limit=1)
|
|
return results[0] if results else None
|
|
```
|
|
|
|
**Error Handling** (`/lib/core/exceptions/`):
|
|
```python
|
|
from ai_assistant_lib import ServiceError, RetryableError, ValidationError
|
|
|
|
# Consistent error handling across the application
|
|
try:
|
|
result = await summarizer.generate_summary(transcript)
|
|
except RetryableError:
|
|
# Automatic retry handled by library
|
|
pass
|
|
except ValidationError as e:
|
|
raise HTTPException(status_code=400, detail=str(e))
|
|
```
|
|
|
|
**Async Utilities** (`/lib/utils/helpers/`):
|
|
```python
|
|
from ai_assistant_lib import with_retry, with_cache, MemoryCache
|
|
|
|
# Automatic retry for external API calls
|
|
@with_retry(max_attempts=3)
|
|
async def extract_youtube_transcript(video_id: str) -> str:
|
|
# Implementation with automatic exponential backoff
|
|
pass
|
|
|
|
# Caching for expensive operations
|
|
cache = MemoryCache(max_size=1000, default_ttl=3600)
|
|
|
|
@with_cache(cache=cache, key_prefix="transcript")
|
|
async def get_cached_transcript(video_id: str) -> str:
|
|
# Expensive transcript extraction cached automatically
|
|
pass
|
|
```
|
|
|
|
#### Project-Specific Usage Patterns:
|
|
|
|
**Backend API Services** (`backend/services/`):
|
|
- `summary_pipeline.py` - Uses `BaseService` for pipeline orchestration
|
|
- `anthropic_summarizer.py` - Extends `BaseAIService` for AI integration
|
|
- `cache_manager.py` - Uses library caching utilities
|
|
- `video_service.py` - Implements service framework patterns
|
|
|
|
**Data Layer** (`backend/models/`, `backend/core/`):
|
|
- `summary.py` - Uses `TimestampedModel` from library
|
|
- `user.py` - Inherits from library base models
|
|
- `database_registry.py` - Extends library database patterns
|
|
|
|
**API Layer** (`backend/api/`):
|
|
- Exception handling uses library error hierarchy
|
|
- Request/response models extend library schemas
|
|
- Dependency injection follows library patterns
|
|
|
|
#### Library Integration Checklist:
|
|
|
|
Before implementing new functionality:
|
|
- [ ] **Check Library First**: Review `/lib/` for existing solutions
|
|
- [ ] **Follow Patterns**: Use established library patterns and base classes
|
|
- [ ] **Extend, Don't Duplicate**: Extend library classes instead of creating from scratch
|
|
- [ ] **Error Handling**: Use library exception hierarchy for consistency
|
|
- [ ] **Testing**: Use library test utilities and patterns
|
|
|
|
#### Common Integration Patterns:
|
|
|
|
```python
|
|
# Service initialization with library framework
|
|
async def create_service() -> VideoService:
|
|
service = VideoService("video_processor")
|
|
await service.initialize() # Lifecycle managed by BaseService
|
|
return service
|
|
|
|
# Repository operations with library patterns
|
|
async def get_summary_data(video_id: str) -> Optional[Summary]:
|
|
repo = SummaryRepository(session, Summary)
|
|
return await repo.find_by_video_id(video_id)
|
|
|
|
# AI service with library retry and caching
|
|
summarizer = AnthropicSummarizer(
|
|
api_key=settings.ANTHROPIC_API_KEY,
|
|
cache_manager=cache_manager, # From library
|
|
retry_config=RetryConfig(max_attempts=3) # From library
|
|
)
|
|
```
|
|
|
|
## 2. Code Standards
|
|
|
|
### Python Style Guide
|
|
|
|
```python
|
|
"""
|
|
Module docstring describing purpose and usage
|
|
"""
|
|
|
|
from typing import List, Optional, Dict, Any
|
|
import asyncio
|
|
from datetime import datetime
|
|
|
|
# Constants in UPPER_CASE
|
|
DEFAULT_TIMEOUT = 30
|
|
MAX_RETRIES = 3
|
|
|
|
class YouTubeSummarizer:
|
|
"""
|
|
Class for summarizing YouTube videos.
|
|
|
|
Attributes:
|
|
model: AI model to use for summarization
|
|
cache: Cache service instance
|
|
"""
|
|
|
|
def __init__(self, model: str = "openai"):
|
|
"""Initialize summarizer with specified model."""
|
|
self.model = model
|
|
self.cache = CacheService()
|
|
|
|
async def summarize(
|
|
self,
|
|
video_url: str,
|
|
options: Optional[Dict[str, Any]] = None
|
|
) -> Dict[str, Any]:
|
|
"""
|
|
Summarize a YouTube video.
|
|
|
|
Args:
|
|
video_url: YouTube video URL
|
|
options: Optional summarization parameters
|
|
|
|
Returns:
|
|
Dictionary containing summary and metadata
|
|
|
|
Raises:
|
|
YouTubeError: If video cannot be accessed
|
|
AIServiceError: If summarization fails
|
|
"""
|
|
# Implementation here
|
|
pass
|
|
```
|
|
|
|
### Type Hints
|
|
|
|
Always use type hints for better code quality:
|
|
|
|
```python
|
|
from typing import Union, List, Optional, Dict, Any, Tuple
|
|
from pydantic import BaseModel, HttpUrl
|
|
|
|
async def process_video(
|
|
url: HttpUrl,
|
|
models: List[str],
|
|
max_length: Optional[int] = None
|
|
) -> Tuple[str, Dict[str, Any]]:
|
|
"""Process video with type safety."""
|
|
pass
|
|
```
|
|
|
|
### Async/Await Pattern
|
|
|
|
Use async for all I/O operations:
|
|
|
|
```python
|
|
async def fetch_transcript(video_id: str) -> str:
|
|
"""Fetch transcript asynchronously."""
|
|
async with aiohttp.ClientSession() as session:
|
|
async with session.get(url) as response:
|
|
return await response.text()
|
|
|
|
# Use asyncio.gather for parallel operations
|
|
results = await asyncio.gather(
|
|
fetch_transcript(id1),
|
|
fetch_transcript(id2),
|
|
fetch_transcript(id3)
|
|
)
|
|
```
|
|
|
|
## 3. Testing Requirements
|
|
|
|
### Test Runner System
|
|
|
|
The project includes a production-ready test runner system with **229 discovered unit tests** and intelligent test categorization.
|
|
|
|
```bash
|
|
# Primary Testing Commands
|
|
./run_tests.sh run-unit --fail-fast # Ultra-fast feedback (0.2s discovery)
|
|
./run_tests.sh run-all --coverage # Complete test suite
|
|
./run_tests.sh run-integration # Integration & API tests
|
|
cd frontend && npm test # Frontend tests
|
|
```
|
|
|
|
### Test Coverage Requirements
|
|
- Minimum 80% code coverage
|
|
- 100% coverage for critical paths
|
|
- All edge cases tested
|
|
- Error conditions covered
|
|
|
|
**📖 Complete Testing Guide**: See [TESTING-INSTRUCTIONS.md](TESTING-INSTRUCTIONS.md) for comprehensive testing standards, procedures, examples, and troubleshooting.
|
|
|
|
## 4. Documentation Standards
|
|
|
|
### Code Documentation
|
|
|
|
Every module, class, and function must have docstrings:
|
|
|
|
```python
|
|
"""
|
|
Module: YouTube Transcript Extractor
|
|
|
|
This module provides functionality to extract transcripts from YouTube videos
|
|
using multiple fallback methods.
|
|
|
|
Example:
|
|
>>> extractor = TranscriptExtractor()
|
|
>>> transcript = await extractor.extract("video_id")
|
|
"""
|
|
|
|
def extract_transcript(
|
|
video_id: str,
|
|
language: str = "en",
|
|
include_auto_generated: bool = True
|
|
) -> List[Dict[str, Any]]:
|
|
"""
|
|
Extract transcript from YouTube video.
|
|
|
|
This function attempts to extract transcripts using the following priority:
|
|
1. Manual captions in specified language
|
|
2. Auto-generated captions if allowed
|
|
3. Translated captions as fallback
|
|
|
|
Args:
|
|
video_id: YouTube video identifier
|
|
language: ISO 639-1 language code (default: "en")
|
|
include_auto_generated: Whether to use auto-generated captions
|
|
|
|
Returns:
|
|
List of transcript segments with text, start time, and duration
|
|
|
|
Raises:
|
|
TranscriptNotAvailable: If no transcript can be extracted
|
|
|
|
Example:
|
|
>>> transcript = extract_transcript("dQw4w9WgXcQ", "en")
|
|
>>> print(transcript[0])
|
|
{"text": "Never gonna give you up", "start": 0.0, "duration": 3.5}
|
|
"""
|
|
pass
|
|
```
|
|
|
|
### API Documentation
|
|
|
|
Use FastAPI's automatic documentation features:
|
|
|
|
```python
|
|
from fastapi import APIRouter, HTTPException, status
|
|
from pydantic import BaseModel, Field
|
|
|
|
router = APIRouter()
|
|
|
|
class SummarizeRequest(BaseModel):
|
|
"""Request model for video summarization."""
|
|
|
|
url: str = Field(
|
|
...,
|
|
description="YouTube video URL",
|
|
example="https://youtube.com/watch?v=dQw4w9WgXcQ"
|
|
)
|
|
model: str = Field(
|
|
"auto",
|
|
description="AI model to use (openai, anthropic, deepseek, auto)",
|
|
example="openai"
|
|
)
|
|
max_length: Optional[int] = Field(
|
|
None,
|
|
description="Maximum summary length in words",
|
|
ge=50,
|
|
le=5000
|
|
)
|
|
|
|
@router.post(
|
|
"/summarize",
|
|
response_model=SummarizeResponse,
|
|
status_code=status.HTTP_200_OK,
|
|
summary="Summarize YouTube Video",
|
|
description="Submit a YouTube video URL for AI-powered summarization"
|
|
)
|
|
async def summarize_video(request: SummarizeRequest):
|
|
"""
|
|
Summarize a YouTube video using AI.
|
|
|
|
This endpoint accepts a YouTube URL and returns a job ID for tracking
|
|
the summarization progress. Use the /summary/{job_id} endpoint to
|
|
retrieve the completed summary.
|
|
"""
|
|
pass
|
|
```
|
|
|
|
## 5. Git Workflow
|
|
|
|
### Branch Naming
|
|
|
|
```bash
|
|
# Feature branches
|
|
feature/task-2-youtube-extraction
|
|
feature/task-3-ai-summarization
|
|
|
|
# Bugfix branches
|
|
bugfix/transcript-encoding-error
|
|
bugfix/rate-limit-handling
|
|
|
|
# Hotfix branches
|
|
hotfix/critical-api-error
|
|
```
|
|
|
|
### Commit Messages
|
|
|
|
Follow conventional commits:
|
|
|
|
```bash
|
|
# Format: <type>(<scope>): <subject>
|
|
|
|
# Examples:
|
|
feat(youtube): add transcript extraction service
|
|
fix(api): handle rate limiting correctly
|
|
docs(readme): update installation instructions
|
|
test(youtube): add edge case tests
|
|
refactor(cache): optimize cache key generation
|
|
perf(summarizer): implement parallel processing
|
|
chore(deps): update requirements.txt
|
|
```
|
|
|
|
### Pull Request Template
|
|
|
|
```markdown
|
|
## Task Reference
|
|
- Task ID: #3
|
|
- Task Title: Develop AI Summary Generation Service
|
|
|
|
## Description
|
|
Brief description of changes made
|
|
|
|
## Changes Made
|
|
- [ ] Implemented YouTube transcript extraction
|
|
- [ ] Added multi-model AI support
|
|
- [ ] Created caching layer
|
|
- [ ] Added comprehensive tests
|
|
|
|
## Testing
|
|
- [ ] Unit tests pass
|
|
- [ ] Integration tests pass
|
|
- [ ] Manual testing completed
|
|
- [ ] Coverage > 80%
|
|
|
|
## Documentation
|
|
- [ ] Code documented
|
|
- [ ] API docs updated
|
|
- [ ] README updated if needed
|
|
|
|
## Screenshots (if applicable)
|
|
[Add screenshots here]
|
|
```
|
|
|
|
## 6. API Design Standards
|
|
|
|
### RESTful Principles
|
|
|
|
```python
|
|
# Good API design
|
|
GET /api/summaries # List all summaries
|
|
GET /api/summaries/{id} # Get specific summary
|
|
POST /api/summaries # Create new summary
|
|
PUT /api/summaries/{id} # Update summary
|
|
DELETE /api/summaries/{id} # Delete summary
|
|
|
|
# Status codes
|
|
200 OK # Successful GET/PUT
|
|
201 Created # Successful POST
|
|
202 Accepted # Processing async request
|
|
204 No Content # Successful DELETE
|
|
400 Bad Request # Invalid input
|
|
401 Unauthorized # Missing/invalid auth
|
|
403 Forbidden # No permission
|
|
404 Not Found # Resource doesn't exist
|
|
429 Too Many Requests # Rate limited
|
|
500 Internal Error # Server error
|
|
```
|
|
|
|
### Response Format
|
|
|
|
```python
|
|
# Success response
|
|
{
|
|
"success": true,
|
|
"data": {
|
|
"id": "uuid",
|
|
"video_id": "abc123",
|
|
"summary": "...",
|
|
"metadata": {}
|
|
},
|
|
"timestamp": "2025-01-25T10:00:00Z"
|
|
}
|
|
|
|
# Error response
|
|
{
|
|
"success": false,
|
|
"error": {
|
|
"code": "TRANSCRIPT_NOT_AVAILABLE",
|
|
"message": "Could not extract transcript from video",
|
|
"details": "No captions available in requested language"
|
|
},
|
|
"timestamp": "2025-01-25T10:00:00Z"
|
|
}
|
|
```
|
|
|
|
### Pagination
|
|
|
|
```python
|
|
@router.get("/summaries")
|
|
async def list_summaries(
|
|
page: int = Query(1, ge=1),
|
|
limit: int = Query(20, ge=1, le=100),
|
|
sort: str = Query("created_at", regex="^(created_at|updated_at|title)$"),
|
|
order: str = Query("desc", regex="^(asc|desc)$")
|
|
):
|
|
"""List summaries with pagination."""
|
|
return {
|
|
"data": summaries,
|
|
"pagination": {
|
|
"page": page,
|
|
"limit": limit,
|
|
"total": total_count,
|
|
"pages": math.ceil(total_count / limit)
|
|
}
|
|
}
|
|
```
|
|
|
|
## 7. Database Operations
|
|
|
|
### SQLAlchemy Models
|
|
|
|
```python
|
|
from sqlalchemy import Column, String, Text, DateTime, Float, JSON
|
|
from sqlalchemy.ext.declarative import declarative_base
|
|
from sqlalchemy.dialects.postgresql import UUID
|
|
import uuid
|
|
|
|
Base = declarative_base()
|
|
|
|
class Summary(Base):
|
|
__tablename__ = "summaries"
|
|
|
|
id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
|
|
video_id = Column(String(20), nullable=False, index=True)
|
|
video_url = Column(Text, nullable=False)
|
|
video_title = Column(Text)
|
|
transcript = Column(Text)
|
|
summary = Column(Text)
|
|
key_points = Column(JSON)
|
|
chapters = Column(JSON)
|
|
model_used = Column(String(50))
|
|
processing_time = Column(Float)
|
|
created_at = Column(DateTime, default=datetime.utcnow)
|
|
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
|
|
|
|
def to_dict(self):
|
|
"""Convert to dictionary for API responses."""
|
|
return {
|
|
"id": str(self.id),
|
|
"video_id": self.video_id,
|
|
"video_title": self.video_title,
|
|
"summary": self.summary,
|
|
"key_points": self.key_points,
|
|
"chapters": self.chapters,
|
|
"model_used": self.model_used,
|
|
"created_at": self.created_at.isoformat()
|
|
}
|
|
```
|
|
|
|
### Database Migrations
|
|
|
|
Use Alembic for migrations:
|
|
|
|
```bash
|
|
# Create new migration
|
|
alembic revision --autogenerate -m "Add chapters column"
|
|
|
|
# Apply migrations
|
|
alembic upgrade head
|
|
|
|
# Rollback
|
|
alembic downgrade -1
|
|
```
|
|
|
|
### Query Optimization
|
|
|
|
```python
|
|
from sqlalchemy import select, and_
|
|
from sqlalchemy.orm import selectinload
|
|
|
|
# Efficient querying with joins
|
|
async def get_summaries_with_metadata(session, user_id: str):
|
|
stmt = (
|
|
select(Summary)
|
|
.options(selectinload(Summary.metadata))
|
|
.where(Summary.user_id == user_id)
|
|
.order_by(Summary.created_at.desc())
|
|
.limit(10)
|
|
)
|
|
|
|
result = await session.execute(stmt)
|
|
return result.scalars().all()
|
|
```
|
|
|
|
## 8. Performance Guidelines
|
|
|
|
### Caching Strategy
|
|
|
|
```python
|
|
from functools import lru_cache
|
|
import redis
|
|
import hashlib
|
|
import json
|
|
|
|
class CacheService:
|
|
def __init__(self):
|
|
self.redis = redis.Redis(decode_responses=True)
|
|
self.ttl = 3600 # 1 hour default
|
|
|
|
def get_key(self, prefix: str, **kwargs) -> str:
|
|
"""Generate cache key from parameters."""
|
|
data = json.dumps(kwargs, sort_keys=True)
|
|
hash_digest = hashlib.md5(data.encode()).hexdigest()
|
|
return f"{prefix}:{hash_digest}"
|
|
|
|
async def get_or_set(self, key: str, func, ttl: int = None):
|
|
"""Get from cache or compute and set."""
|
|
# Try cache first
|
|
cached = self.redis.get(key)
|
|
if cached:
|
|
return json.loads(cached)
|
|
|
|
# Compute result
|
|
result = await func()
|
|
|
|
# Cache result
|
|
self.redis.setex(
|
|
key,
|
|
ttl or self.ttl,
|
|
json.dumps(result)
|
|
)
|
|
|
|
return result
|
|
```
|
|
|
|
### Async Processing
|
|
|
|
```python
|
|
from celery import Celery
|
|
from typing import Dict, Any
|
|
|
|
celery_app = Celery('youtube_summarizer')
|
|
|
|
@celery_app.task
|
|
async def process_video_task(video_url: str, options: Dict[str, Any]):
|
|
"""Background task for video processing."""
|
|
try:
|
|
# Extract transcript
|
|
transcript = await extract_transcript(video_url)
|
|
|
|
# Generate summary
|
|
summary = await generate_summary(transcript, options)
|
|
|
|
# Save to database
|
|
await save_summary(video_url, summary)
|
|
|
|
return {"status": "completed", "summary_id": summary.id}
|
|
except Exception as e:
|
|
return {"status": "failed", "error": str(e)}
|
|
```
|
|
|
|
### Performance Monitoring
|
|
|
|
```python
|
|
import time
|
|
from functools import wraps
|
|
import logging
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
def measure_performance(func):
|
|
"""Decorator to measure function performance."""
|
|
@wraps(func)
|
|
async def wrapper(*args, **kwargs):
|
|
start = time.perf_counter()
|
|
try:
|
|
result = await func(*args, **kwargs)
|
|
elapsed = time.perf_counter() - start
|
|
logger.info(f"{func.__name__} took {elapsed:.3f}s")
|
|
return result
|
|
except Exception as e:
|
|
elapsed = time.perf_counter() - start
|
|
logger.error(f"{func.__name__} failed after {elapsed:.3f}s: {e}")
|
|
raise
|
|
return wrapper
|
|
```
|
|
|
|
## 9. Security Protocols
|
|
|
|
### Input Validation
|
|
|
|
```python
|
|
from pydantic import BaseModel, validator, HttpUrl
|
|
import re
|
|
|
|
class VideoURLValidator(BaseModel):
|
|
url: HttpUrl
|
|
|
|
@validator('url')
|
|
def validate_youtube_url(cls, v):
|
|
youtube_regex = re.compile(
|
|
r'(https?://)?(www\.)?(youtube\.com|youtu\.be)/.+'
|
|
)
|
|
if not youtube_regex.match(str(v)):
|
|
raise ValueError('Invalid YouTube URL')
|
|
return v
|
|
```
|
|
|
|
### API Key Management
|
|
|
|
```python
|
|
from pydantic import BaseSettings
|
|
|
|
class Settings(BaseSettings):
|
|
"""Application settings with validation."""
|
|
|
|
# API Keys (never hardcode!)
|
|
openai_api_key: str
|
|
anthropic_api_key: str
|
|
youtube_api_key: Optional[str] = None
|
|
|
|
# Security
|
|
secret_key: str
|
|
allowed_origins: List[str] = ["http://localhost:3000"]
|
|
|
|
class Config:
|
|
env_file = ".env"
|
|
env_file_encoding = "utf-8"
|
|
case_sensitive = False
|
|
|
|
settings = Settings()
|
|
```
|
|
|
|
### Rate Limiting
|
|
|
|
```python
|
|
from fastapi import Request, HTTPException
|
|
from fastapi_limiter import FastAPILimiter
|
|
from fastapi_limiter.depends import RateLimiter
|
|
import redis.asyncio as redis
|
|
|
|
# Initialize rate limiter
|
|
async def init_rate_limiter():
|
|
redis_client = redis.from_url("redis://localhost:6379", encoding="utf-8", decode_responses=True)
|
|
await FastAPILimiter.init(redis_client)
|
|
|
|
# Apply rate limiting
|
|
@router.post("/summarize", dependencies=[Depends(RateLimiter(times=10, seconds=60))])
|
|
async def summarize_video(request: SummarizeRequest):
|
|
"""Rate limited to 10 requests per minute."""
|
|
pass
|
|
```
|
|
|
|
## 10. Deployment Process
|
|
|
|
### Docker Configuration
|
|
|
|
```dockerfile
|
|
# Dockerfile
|
|
FROM python:3.11-slim
|
|
|
|
WORKDIR /app
|
|
|
|
# Install dependencies
|
|
COPY requirements.txt .
|
|
RUN pip install --no-cache-dir -r requirements.txt
|
|
|
|
# Copy application
|
|
COPY . .
|
|
|
|
# Run application
|
|
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8082"]
|
|
```
|
|
|
|
### Environment Management
|
|
|
|
```bash
|
|
# .env.development
|
|
DEBUG=true
|
|
DATABASE_URL=sqlite:///./dev.db
|
|
LOG_LEVEL=DEBUG
|
|
|
|
# .env.production
|
|
DEBUG=false
|
|
DATABASE_URL=postgresql://user:pass@db:5432/youtube_summarizer
|
|
LOG_LEVEL=INFO
|
|
```
|
|
|
|
### Health Checks
|
|
|
|
```python
|
|
@router.get("/health")
|
|
async def health_check():
|
|
"""Health check endpoint for monitoring."""
|
|
checks = {
|
|
"api": "healthy",
|
|
"database": await check_database(),
|
|
"cache": await check_cache(),
|
|
"ai_service": await check_ai_service()
|
|
}
|
|
|
|
all_healthy = all(v == "healthy" for v in checks.values())
|
|
|
|
return {
|
|
"status": "healthy" if all_healthy else "degraded",
|
|
"checks": checks,
|
|
"timestamp": datetime.utcnow().isoformat()
|
|
}
|
|
```
|
|
|
|
### Monitoring
|
|
|
|
```python
|
|
from prometheus_client import Counter, Histogram, generate_latest
|
|
|
|
# Metrics
|
|
request_count = Counter('youtube_requests_total', 'Total requests')
|
|
request_duration = Histogram('youtube_request_duration_seconds', 'Request duration')
|
|
summary_generation_time = Histogram('summary_generation_seconds', 'Summary generation time')
|
|
|
|
@router.get("/metrics")
|
|
async def metrics():
|
|
"""Prometheus metrics endpoint."""
|
|
return Response(generate_latest(), media_type="text/plain")
|
|
```
|
|
|
|
## Agent-Specific Instructions
|
|
|
|
### For AI Agents
|
|
|
|
When working on this codebase:
|
|
|
|
1. **Always check Task Master first**: `task-master next`
|
|
2. **Follow TDD**: Write tests before implementation
|
|
3. **Use type hints**: All functions must have type annotations
|
|
4. **Document changes**: Update docstrings and comments
|
|
5. **Test thoroughly**: Run full test suite before marking complete
|
|
6. **Update task status**: Keep Task Master updated with progress
|
|
|
|
### Quality Checklist
|
|
|
|
Before marking any task as complete:
|
|
|
|
- [ ] All tests pass (`./run_tests.sh run-all`)
|
|
- [ ] Code coverage > 80% (`./run_tests.sh run-all --coverage`)
|
|
- [ ] Unit tests pass with fast feedback (`./run_tests.sh run-unit --fail-fast`)
|
|
- [ ] Integration tests validated (`./run_tests.sh run-integration`)
|
|
- [ ] Frontend tests pass (`cd frontend && npm test`)
|
|
- [ ] No linting errors (`ruff check src/`)
|
|
- [ ] Type checking passes (`mypy src/`)
|
|
- [ ] Documentation updated
|
|
- [ ] Task Master updated
|
|
- [ ] Changes committed with proper message
|
|
|
|
**📖 Testing Details**: See [TESTING-INSTRUCTIONS.md](TESTING-INSTRUCTIONS.md) for complete testing procedures and standards.
|
|
|
|
## Conclusion
|
|
|
|
This guide ensures consistent, high-quality development across all contributors to the YouTube Summarizer project. Follow these standards to maintain code quality, performance, and security.
|
|
|
|
---
|
|
|
|
*Last Updated: 2025-01-25*
|
|
*Version: 1.0.0* |