481 lines
15 KiB
Markdown
481 lines
15 KiB
Markdown
# CLAUDE.md - YouTube Summarizer
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with the YouTube Summarizer project.
|
|
|
|
## Project Overview
|
|
|
|
An AI-powered web application that automatically extracts, transcribes, and summarizes YouTube videos. The application supports multiple AI models (OpenAI, Anthropic, DeepSeek), provides various export formats, and includes intelligent caching for efficiency.
|
|
|
|
**Status**: Development Ready - All Epic 1 & 2 stories created and ready for implementation
|
|
- **Epic 1**: Foundation & Core YouTube Integration (Story 1.1 ✅ Complete, Stories 1.2-1.4 📋 Ready)
|
|
- **Epic 2**: AI Summarization Engine (Stories 2.1-2.5 📋 All Created and Ready)
|
|
- **Epic 3**: Enhanced User Experience (Future - Ready for story creation)
|
|
|
|
## Quick Start Commands
|
|
|
|
```bash
|
|
# Development Setup
|
|
cd apps/youtube-summarizer
|
|
docker-compose up # Start full development environment
|
|
|
|
# BMad Method Story Management
|
|
/BMad:agents:sm # Activate Scrum Master agent
|
|
*draft # Create next story
|
|
*story-checklist # Validate story quality
|
|
|
|
# Development Agent Implementation
|
|
/BMad:agents:dev # Activate Development agent
|
|
# Follow story specifications in docs/stories/
|
|
|
|
# Direct Development (without BMad agents)
|
|
source venv/bin/activate # Activate virtual environment
|
|
python backend/main.py # Run backend (port 8000)
|
|
cd frontend && npm run dev # Run frontend (port 3000)
|
|
|
|
# Testing
|
|
pytest backend/tests/ -v # Backend tests
|
|
cd frontend && npm test # Frontend tests
|
|
|
|
# Git Operations
|
|
git add .
|
|
git commit -m "feat: implement story 1.2 - URL validation"
|
|
git push origin main
|
|
```
|
|
|
|
## Architecture
|
|
|
|
```
|
|
YouTube Summarizer
|
|
├── API Layer (FastAPI)
|
|
│ ├── /api/summarize - Submit URL for summarization
|
|
│ ├── /api/summary/{id} - Retrieve summary
|
|
│ └── /api/export/{id} - Export in various formats
|
|
├── Service Layer
|
|
│ ├── YouTube Service - Transcript extraction
|
|
│ ├── AI Service - Summary generation
|
|
│ └── Cache Service - Performance optimization
|
|
└── Data Layer
|
|
├── SQLite/PostgreSQL - Summary storage
|
|
└── Redis (optional) - Caching layer
|
|
```
|
|
|
|
## Development Workflow - BMad Method
|
|
|
|
### Story-Driven Development Process
|
|
|
|
**Current Epic**: Epic 1 - Foundation & Core YouTube Integration
|
|
**Current Stories**:
|
|
- ✅ Story 1.1: Project Setup and Infrastructure (Completed)
|
|
- 📝 Story 1.2: YouTube URL Validation and Parsing (Ready for implementation)
|
|
- ⏳ Story 1.3: Transcript Extraction Service (Pending)
|
|
- ⏳ Story 1.4: Basic Web Interface (Pending)
|
|
|
|
### 1. Story Planning (Scrum Master)
|
|
```bash
|
|
# Activate Scrum Master agent
|
|
/BMad:agents:sm
|
|
*draft # Create next story in sequence
|
|
*story-checklist # Validate story completeness
|
|
```
|
|
|
|
### 2. Story Implementation (Development Agent)
|
|
```bash
|
|
# Activate Development agent
|
|
/BMad:agents:dev
|
|
# Review story file: docs/stories/{epic}.{story}.{name}.md
|
|
# Follow detailed Dev Notes and architecture references
|
|
# Implement all tasks and subtasks as specified
|
|
```
|
|
|
|
### 3. Implementation Locations
|
|
Based on architecture and story specifications:
|
|
- **Backend API** → `backend/api/`
|
|
- **Backend Services** → `backend/services/`
|
|
- **Backend Models** → `backend/models/`
|
|
- **Frontend Components** → `frontend/src/components/`
|
|
- **Frontend Hooks** → `frontend/src/hooks/`
|
|
- **Frontend API Client** → `frontend/src/api/`
|
|
|
|
### 4. Testing Implementation
|
|
```bash
|
|
# Backend testing (pytest)
|
|
pytest backend/tests/unit/test_<module>.py -v
|
|
pytest backend/tests/integration/ -v
|
|
|
|
# Frontend testing (Vitest + RTL)
|
|
cd frontend && npm test
|
|
cd frontend && npm run test:coverage
|
|
|
|
# Manual testing
|
|
docker-compose up # Full stack
|
|
# Visit http://localhost:3000 (frontend)
|
|
# Visit http://localhost:8000/docs (API docs)
|
|
```
|
|
|
|
### 5. Story Completion
|
|
- Mark all tasks/subtasks complete in story file
|
|
- Update story status from "Draft" to "Done"
|
|
- Run story validation checklist
|
|
- Update epic progress tracking
|
|
|
|
## Key Implementation Areas
|
|
|
|
### YouTube Integration (`src/services/youtube.py`)
|
|
```python
|
|
# Primary: youtube-transcript-api
|
|
from youtube_transcript_api import YouTubeTranscriptApi
|
|
|
|
# Fallback: yt-dlp for metadata
|
|
import yt_dlp
|
|
|
|
# Extract video ID from various URL formats
|
|
# Handle multiple subtitle languages
|
|
# Implement retry logic for failures
|
|
```
|
|
|
|
### AI Summarization (`src/services/summarizer.py`)
|
|
```python
|
|
# Multi-model support
|
|
class SummarizerService:
|
|
def __init__(self):
|
|
self.models = {
|
|
'openai': OpenAISummarizer(),
|
|
'anthropic': AnthropicSummarizer(),
|
|
'deepseek': DeepSeekSummarizer()
|
|
}
|
|
|
|
async def summarize(self, transcript, model='auto'):
|
|
# Implement model selection logic
|
|
# Handle token limits
|
|
# Generate structured summaries
|
|
```
|
|
|
|
### Caching Strategy (`src/services/cache.py`)
|
|
```python
|
|
# Cache at multiple levels:
|
|
# 1. Transcript cache (by video_id)
|
|
# 2. Summary cache (by video_id + model + params)
|
|
# 3. Export cache (by summary_id + format)
|
|
|
|
# Use hash for cache keys
|
|
import hashlib
|
|
|
|
def get_cache_key(video_id: str, model: str, params: dict) -> str:
|
|
key_data = f"{video_id}:{model}:{json.dumps(params, sort_keys=True)}"
|
|
return hashlib.sha256(key_data.encode()).hexdigest()
|
|
```
|
|
|
|
## API Endpoint Patterns
|
|
|
|
### FastAPI Best Practices
|
|
```python
|
|
from fastapi import APIRouter, HTTPException, BackgroundTasks
|
|
from pydantic import BaseModel, HttpUrl
|
|
|
|
router = APIRouter(prefix="/api", tags=["summarization"])
|
|
|
|
class SummarizeRequest(BaseModel):
|
|
url: HttpUrl
|
|
model: str = "auto"
|
|
options: dict = {}
|
|
|
|
@router.post("/summarize")
|
|
async def summarize_video(
|
|
request: SummarizeRequest,
|
|
background_tasks: BackgroundTasks
|
|
):
|
|
# Validate URL
|
|
# Extract video ID
|
|
# Check cache
|
|
# Queue for processing if needed
|
|
# Return job ID for status checking
|
|
```
|
|
|
|
## Database Schema
|
|
|
|
```sql
|
|
-- Main summaries table
|
|
CREATE TABLE summaries (
|
|
id UUID PRIMARY KEY,
|
|
video_id VARCHAR(20) NOT NULL,
|
|
video_title TEXT,
|
|
video_url TEXT NOT NULL,
|
|
transcript TEXT,
|
|
summary TEXT,
|
|
key_points JSONB,
|
|
chapters JSONB,
|
|
model_used VARCHAR(50),
|
|
processing_time FLOAT,
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
|
);
|
|
|
|
-- Cache for performance
|
|
CREATE INDEX idx_video_id ON summaries(video_id);
|
|
CREATE INDEX idx_created_at ON summaries(created_at);
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
```python
|
|
class YouTubeError(Exception):
|
|
"""Base exception for YouTube-related errors"""
|
|
pass
|
|
|
|
class TranscriptNotAvailable(YouTubeError):
|
|
"""Raised when transcript cannot be extracted"""
|
|
pass
|
|
|
|
class AIServiceError(Exception):
|
|
"""Base exception for AI service errors"""
|
|
pass
|
|
|
|
class TokenLimitExceeded(AIServiceError):
|
|
"""Raised when content exceeds model token limit"""
|
|
pass
|
|
|
|
# Global error handler
|
|
@app.exception_handler(YouTubeError)
|
|
async def youtube_error_handler(request, exc):
|
|
return JSONResponse(
|
|
status_code=400,
|
|
content={"error": str(exc), "type": "youtube_error"}
|
|
)
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
```bash
|
|
# Required
|
|
OPENAI_API_KEY=sk-... # At least one AI key required
|
|
ANTHROPIC_API_KEY=sk-ant-...
|
|
DEEPSEEK_API_KEY=sk-...
|
|
DATABASE_URL=sqlite:///./data/youtube_summarizer.db
|
|
SECRET_KEY=your-secret-key
|
|
|
|
# Optional but recommended
|
|
YOUTUBE_API_KEY=AIza... # For metadata and quota
|
|
REDIS_URL=redis://localhost:6379/0
|
|
RATE_LIMIT_PER_MINUTE=30
|
|
MAX_VIDEO_LENGTH_MINUTES=180
|
|
```
|
|
|
|
## Testing Guidelines
|
|
|
|
### Unit Test Structure
|
|
```python
|
|
# tests/unit/test_youtube_service.py
|
|
import pytest
|
|
from unittest.mock import Mock, patch
|
|
from src.services.youtube import YouTubeService
|
|
|
|
@pytest.fixture
|
|
def youtube_service():
|
|
return YouTubeService()
|
|
|
|
def test_extract_video_id(youtube_service):
|
|
urls = [
|
|
("https://youtube.com/watch?v=abc123", "abc123"),
|
|
("https://youtu.be/xyz789", "xyz789"),
|
|
("https://www.youtube.com/embed/qwe456", "qwe456")
|
|
]
|
|
for url, expected_id in urls:
|
|
assert youtube_service.extract_video_id(url) == expected_id
|
|
```
|
|
|
|
### Integration Test Pattern
|
|
```python
|
|
# tests/integration/test_api.py
|
|
from fastapi.testclient import TestClient
|
|
from src.main import app
|
|
|
|
client = TestClient(app)
|
|
|
|
def test_summarize_endpoint():
|
|
response = client.post("/api/summarize", json={
|
|
"url": "https://youtube.com/watch?v=test123",
|
|
"model": "openai"
|
|
})
|
|
assert response.status_code == 200
|
|
assert "job_id" in response.json()
|
|
```
|
|
|
|
## Performance Optimization
|
|
|
|
1. **Async Everything**: Use async/await for all I/O operations
|
|
2. **Background Tasks**: Process summaries in background
|
|
3. **Caching Layers**:
|
|
- Memory cache for hot data
|
|
- Database cache for persistence
|
|
- CDN for static exports
|
|
4. **Rate Limiting**: Implement per-IP and per-user limits
|
|
5. **Token Optimization**:
|
|
- Chunk long transcripts
|
|
- Use map-reduce for summaries
|
|
- Implement progressive summarization
|
|
|
|
## Security Considerations
|
|
|
|
1. **Input Validation**: Validate all YouTube URLs
|
|
2. **API Key Management**: Use environment variables, never commit keys
|
|
3. **Rate Limiting**: Prevent abuse and API exhaustion
|
|
4. **CORS Configuration**: Restrict to known domains in production
|
|
5. **SQL Injection Prevention**: Use parameterized queries
|
|
6. **XSS Protection**: Sanitize all user inputs
|
|
7. **Authentication**: Implement JWT for user sessions (Phase 3)
|
|
|
|
## Common Issues and Solutions
|
|
|
|
### Issue: Transcript Not Available
|
|
```python
|
|
# Solution: Implement fallback chain
|
|
try:
|
|
transcript = await get_youtube_transcript(video_id)
|
|
except TranscriptNotAvailable:
|
|
# Try auto-generated captions
|
|
transcript = await get_auto_captions(video_id)
|
|
if not transcript:
|
|
# Use audio transcription as last resort
|
|
transcript = await transcribe_audio(video_id)
|
|
```
|
|
|
|
### Issue: Token Limit Exceeded
|
|
```python
|
|
# Solution: Implement chunking
|
|
def chunk_transcript(transcript, max_tokens=3000):
|
|
chunks = []
|
|
current_chunk = []
|
|
current_tokens = 0
|
|
|
|
for segment in transcript:
|
|
segment_tokens = count_tokens(segment)
|
|
if current_tokens + segment_tokens > max_tokens:
|
|
chunks.append(current_chunk)
|
|
current_chunk = [segment]
|
|
current_tokens = segment_tokens
|
|
else:
|
|
current_chunk.append(segment)
|
|
current_tokens += segment_tokens
|
|
|
|
if current_chunk:
|
|
chunks.append(current_chunk)
|
|
|
|
return chunks
|
|
```
|
|
|
|
### Issue: Rate Limiting
|
|
```python
|
|
# Solution: Implement exponential backoff
|
|
import asyncio
|
|
from typing import Optional
|
|
|
|
async def retry_with_backoff(
|
|
func,
|
|
max_retries: int = 3,
|
|
initial_delay: float = 1.0
|
|
) -> Optional[Any]:
|
|
delay = initial_delay
|
|
for attempt in range(max_retries):
|
|
try:
|
|
return await func()
|
|
except RateLimitError:
|
|
if attempt == max_retries - 1:
|
|
raise
|
|
await asyncio.sleep(delay)
|
|
delay *= 2 # Exponential backoff
|
|
```
|
|
|
|
## Development Tips
|
|
|
|
1. **Start with Task 1**: Setup and environment configuration
|
|
2. **Test Early**: Write tests as you implement features
|
|
3. **Use Type Hints**: Improve code quality and IDE support
|
|
4. **Document APIs**: Use FastAPI's automatic documentation
|
|
5. **Log Everything**: Implement comprehensive logging for debugging
|
|
6. **Cache Aggressively**: Reduce API calls and improve response times
|
|
7. **Handle Errors Gracefully**: Provide helpful error messages to users
|
|
|
|
## Task Master Integration
|
|
|
|
This project uses Task Master for task management. Key commands:
|
|
|
|
```bash
|
|
# View current progress
|
|
task-master list
|
|
|
|
# Get detailed task info
|
|
task-master show 1
|
|
|
|
# Expand task into subtasks
|
|
task-master expand --id=1 --research
|
|
|
|
# Update task with progress
|
|
task-master update-task --id=1 --prompt="Completed API structure"
|
|
|
|
# Complete task
|
|
task-master set-status --id=1 --status=done
|
|
```
|
|
|
|
## BMad Method Documentation Structure
|
|
|
|
### Core Documentation
|
|
- **[Project README](README.md)** - General project information and setup
|
|
- **[Architecture](docs/architecture.md)** - Complete technical architecture specification
|
|
- **[Front-End Spec](docs/front-end-spec.md)** - UI/UX requirements and component specifications
|
|
- **[Original PRD](docs/prd.md)** - Complete product requirements document
|
|
|
|
### Epic and Story Management
|
|
- **[Epic Index](docs/prd/index.md)** - Epic overview and progress tracking
|
|
- **[Epic 1](docs/prd/epic-1-foundation-core-youtube-integration.md)** - Foundation epic details
|
|
- **[Epic 2](docs/prd/epic-2-ai-summarization-engine.md)** - AI engine epic details
|
|
- **[Epic 3](docs/prd/epic-3-enhanced-user-experience.md)** - Advanced features epic
|
|
- **[Stories](docs/stories/)** - Individual story implementations
|
|
|
|
### Current Story Files
|
|
|
|
**Epic 1 - Foundation (Sprint 1)**:
|
|
- **[Story 1.1](docs/stories/1.1.project-setup-infrastructure.md)** - ✅ Project setup (COMPLETED)
|
|
- **[Story 1.2](docs/stories/1.2.youtube-url-validation-parsing.md)** - 📋 URL validation (READY)
|
|
- **[Story 1.3](docs/stories/1.3.transcript-extraction-service.md)** - 📋 Transcript extraction (READY)
|
|
- **[Story 1.4](docs/stories/1.4.basic-web-interface.md)** - 📋 Web interface (READY)
|
|
|
|
**Epic 2 - AI Engine (Sprints 2-3)**:
|
|
- **[Story 2.1](docs/stories/2.1.single-ai-model-integration.md)** - 📋 OpenAI integration (READY)
|
|
- **[Story 2.2](docs/stories/2.2.summary-generation-pipeline.md)** - 📋 Pipeline orchestration (READY)
|
|
- **[Story 2.3](docs/stories/2.3.caching-system-implementation.md)** - 📋 Caching system (READY)
|
|
- **[Story 2.4](docs/stories/2.4.multi-model-support.md)** - 📋 Multi-model AI (READY)
|
|
- **[Story 2.5](docs/stories/2.5.export-functionality.md)** - 📋 Export features (READY)
|
|
|
|
### Development Workflow
|
|
1. **Check Epic Progress**: Review [Epic Index](docs/prd/index.md) for current status
|
|
2. **Review Next Story**: Read story file for implementation details
|
|
3. **Follow Dev Notes**: Use architecture references and technical specifications
|
|
4. **Implement & Test**: Follow story tasks/subtasks systematically
|
|
5. **Update Progress**: Mark story complete and update epic status
|
|
|
|
### Story-Based Implementation Priority
|
|
|
|
**Current Focus**: Epic 1 - Foundation & Core YouTube Integration
|
|
|
|
**Sprint 1 (Weeks 1-2)** - Epic 1 Implementation:
|
|
1. **Story 1.2** - YouTube URL Validation and Parsing (8-12 hours) ⬅️ **START HERE**
|
|
2. **Story 1.3** - Transcript Extraction Service (16-20 hours)
|
|
3. **Story 1.4** - Basic Web Interface (16-24 hours)
|
|
|
|
**Sprint 2 (Weeks 3-4)** - Epic 2 Core:
|
|
4. **Story 2.1** - Single AI Model Integration (12-16 hours)
|
|
5. **Story 2.2** - Summary Generation Pipeline (16-20 hours)
|
|
6. **Story 2.3** - Caching System Implementation (12-16 hours)
|
|
|
|
**Sprint 3 (Weeks 5-6)** - Epic 2 Advanced:
|
|
7. **Story 2.4** - Multi-Model Support (16-20 hours)
|
|
8. **Story 2.5** - Export Functionality (12-16 hours)
|
|
|
|
**Developer Resources**:
|
|
- [Developer Handoff Guide](docs/DEVELOPER_HANDOFF.md) - Start here for implementation
|
|
- [Sprint Planning](docs/SPRINT_PLANNING.md) - Detailed sprint breakdown
|
|
- [Story Files](docs/stories/) - All stories with complete Dev Notes
|
|
|
|
---
|
|
|
|
*This guide is specifically tailored for Claude Code development on the YouTube Summarizer project.* |