487 lines
17 KiB
Markdown
487 lines
17 KiB
Markdown
# CLAUDE.md - YouTube Summarizer Backend
|
|
|
|
This file provides guidance to Claude Code when working with the YouTube Summarizer backend services.
|
|
|
|
## Backend Architecture Overview
|
|
|
|
The backend is built with FastAPI and follows a clean architecture pattern with clear separation of concerns:
|
|
|
|
```
|
|
backend/
|
|
├── api/ # API endpoints and request/response models
|
|
├── services/ # Business logic and external integrations
|
|
├── models/ # Data models and database schemas
|
|
├── core/ # Core utilities, exceptions, and configurations
|
|
└── tests/ # Unit and integration tests
|
|
```
|
|
|
|
## Key Services and Components
|
|
|
|
### Authentication System (Story 3.1 - COMPLETE ✅)
|
|
|
|
**Architecture**: Production-ready JWT-based authentication with Database Registry singleton pattern
|
|
|
|
**AuthService** (`services/auth_service.py`)
|
|
- JWT token generation and validation (access + refresh tokens)
|
|
- Password hashing with bcrypt and strength validation
|
|
- User registration with email verification workflow
|
|
- Password reset with secure token generation
|
|
- Session management and token refresh logic
|
|
|
|
**Database Registry Pattern** (`core/database_registry.py`)
|
|
- **CRITICAL FIX**: Resolves SQLAlchemy "Multiple classes found for path" errors
|
|
- Singleton pattern ensuring single Base instance across application
|
|
- Automatic model registration preventing table redefinition conflicts
|
|
- Thread-safe model management with registry cleanup for testing
|
|
- Production-ready architecture preventing relationship resolver issues
|
|
|
|
**Authentication Models** (`models/user.py`)
|
|
- User, RefreshToken, APIKey, EmailVerificationToken, PasswordResetToken
|
|
- Fully qualified relationship paths preventing SQLAlchemy conflicts
|
|
- String UUID fields for SQLite compatibility
|
|
- Proper model inheritance using Database Registry Base
|
|
|
|
**Authentication API** (`api/auth.py`)
|
|
- Complete endpoint coverage: register, login, logout, refresh, verify email, reset password
|
|
- Comprehensive input validation and error handling
|
|
- Protected route dependencies and middleware
|
|
- Async/await patterns throughout
|
|
|
|
### Dual Transcript Services ✅ **NEW**
|
|
|
|
**DualTranscriptService** (`services/dual_transcript_service.py`)
|
|
- Orchestrates between YouTube captions and Whisper AI transcription
|
|
- Supports three extraction modes: `youtube`, `whisper`, `both`
|
|
- Parallel processing for comparison mode with real-time progress updates
|
|
- Advanced quality comparison with punctuation/capitalization analysis
|
|
- Processing time estimation and intelligent recommendation engine
|
|
- Seamless integration with existing TranscriptService
|
|
|
|
**WhisperTranscriptService** (`services/whisper_transcript_service.py`)
|
|
- OpenAI Whisper integration for high-quality YouTube video transcription
|
|
- Async audio download via yt-dlp with automatic cleanup
|
|
- Intelligent chunking for long videos (30-minute segments with overlap)
|
|
- Device detection (CPU/CUDA) for optimal performance
|
|
- Quality and confidence scoring algorithms
|
|
- Production-ready error handling and resource management
|
|
|
|
### Core Pipeline Services
|
|
|
|
**IntelligentVideoDownloader** (`services/intelligent_video_downloader.py`) ✅ **NEW**
|
|
- **9-Tier Transcript Extraction Fallback Chain**:
|
|
1. YouTube Transcript API - Primary method using official API
|
|
2. Auto-generated Captions - YouTube's automatic captions fallback
|
|
3. Whisper AI Transcription - OpenAI Whisper for high-quality audio transcription
|
|
4. PyTubeFix Downloader - Alternative YouTube library
|
|
5. YT-DLP Downloader - Robust video/audio extraction tool
|
|
6. Playwright Browser - Browser automation for JavaScript-rendered content
|
|
7. External Tools - 4K Video Downloader CLI integration
|
|
8. Web Services - Third-party transcript API services
|
|
9. Transcript-Only - Metadata without full transcript as final fallback
|
|
- **Audio Retention System** for re-transcription capability
|
|
- **Intelligent method selection** based on success rates
|
|
- **Comprehensive error handling** with detailed logging
|
|
- **Performance telemetry** and health monitoring
|
|
|
|
**SummaryPipeline** (`services/summary_pipeline.py`)
|
|
- Main orchestration service for end-to-end video processing
|
|
- 7-stage async pipeline: URL validation → metadata extraction → transcript → analysis → summarization → quality validation → completion
|
|
- Integrates with IntelligentVideoDownloader for robust transcript extraction
|
|
- Intelligent content analysis and configuration optimization
|
|
- Real-time progress tracking via WebSocket
|
|
- Automatic retry logic with exponential backoff
|
|
- Quality scoring and validation system
|
|
|
|
**AnthropicSummarizer** (`services/anthropic_summarizer.py`)
|
|
- AI service integration using Claude 3.5 Haiku for cost efficiency
|
|
- Structured JSON output with fallback text parsing
|
|
- Token counting and cost estimation
|
|
- Intelligent chunking for long transcripts (up to 200k context)
|
|
- Comprehensive error handling and retry logic
|
|
|
|
**CacheManager** (`services/cache_manager.py`)
|
|
- Multi-level caching for pipeline results, transcripts, and metadata
|
|
- TTL-based expiration with automatic cleanup
|
|
- Redis-ready architecture for production scaling
|
|
- Configurable cache keys with collision prevention
|
|
|
|
**WebSocketManager** (`core/websocket_manager.py`)
|
|
- Singleton pattern for WebSocket connection management
|
|
- Job-specific connection tracking and broadcasting
|
|
- Real-time progress updates and completion notifications
|
|
- Heartbeat mechanism and stale connection cleanup
|
|
|
|
**NotificationService** (`services/notification_service.py`)
|
|
- Multi-type notifications (completion, error, progress, system)
|
|
- Notification history and statistics tracking
|
|
- Email/webhook integration ready architecture
|
|
- Configurable filtering and management
|
|
|
|
### API Layer
|
|
|
|
**Pipeline API** (`api/pipeline.py`)
|
|
- Complete pipeline management endpoints
|
|
- Process video with configuration options
|
|
- Status monitoring and job history
|
|
- Pipeline cancellation and cleanup
|
|
- Health checks and system statistics
|
|
|
|
**Summarization API** (`api/summarization.py`)
|
|
- Direct AI summarization endpoints
|
|
- Sync and async processing options
|
|
- Cost estimation and validation
|
|
- Background job management
|
|
|
|
**Dual Transcript API** (`api/transcripts.py`) ✅ **NEW**
|
|
- `POST /api/transcripts/dual/extract` - Start dual transcript extraction
|
|
- `GET /api/transcripts/dual/jobs/{job_id}` - Monitor extraction progress
|
|
- `POST /api/transcripts/dual/estimate` - Get processing time estimates
|
|
- `GET /api/transcripts/dual/compare/{video_id}` - Force comparison analysis
|
|
- Background job processing with real-time progress updates
|
|
- YouTube captions, Whisper AI, or both sources simultaneously
|
|
|
|
## Development Patterns
|
|
|
|
### Service Dependency Injection
|
|
|
|
```python
|
|
def get_summary_pipeline(
|
|
video_service: VideoService = Depends(get_video_service),
|
|
transcript_service: TranscriptService = Depends(get_transcript_service),
|
|
ai_service: AnthropicSummarizer = Depends(get_ai_service),
|
|
cache_manager: CacheManager = Depends(get_cache_manager),
|
|
notification_service: NotificationService = Depends(get_notification_service)
|
|
) -> SummaryPipeline:
|
|
return SummaryPipeline(...)
|
|
```
|
|
|
|
### Database Registry Pattern (CRITICAL ARCHITECTURE)
|
|
|
|
**Problem Solved**: SQLAlchemy "Multiple classes found for path" relationship resolver errors
|
|
|
|
```python
|
|
# Always use the registry for model creation
|
|
from backend.core.database_registry import registry
|
|
from backend.models.base import Model
|
|
|
|
# Models inherit from Model (which uses registry.Base)
|
|
class User(Model):
|
|
__tablename__ = "users"
|
|
# Use fully qualified relationship paths to prevent conflicts
|
|
summaries = relationship("backend.models.summary.Summary", back_populates="user")
|
|
|
|
# Registry ensures single Base instance and safe model registration
|
|
registry.create_all_tables(engine) # For table creation
|
|
registry.register_model(ModelClass) # Automatic via BaseModel mixin
|
|
```
|
|
|
|
**Key Benefits**:
|
|
- Prevents SQLAlchemy table redefinition conflicts
|
|
- Thread-safe singleton pattern
|
|
- Automatic model registration and deduplication
|
|
- Production-ready architecture
|
|
- Clean testing with registry reset capabilities
|
|
|
|
### Authentication Pattern
|
|
|
|
```python
|
|
# Protected endpoint with user dependency
|
|
@router.post("/api/protected")
|
|
async def protected_endpoint(
|
|
current_user: User = Depends(get_current_user),
|
|
db: Session = Depends(get_db)
|
|
):
|
|
return {"user_id": current_user.id}
|
|
|
|
# JWT token validation and refresh
|
|
from backend.services.auth_service import AuthService
|
|
auth_service = AuthService()
|
|
user = await auth_service.authenticate_user(email, password)
|
|
tokens = auth_service.create_access_token(user)
|
|
```
|
|
|
|
### Async Pipeline Pattern
|
|
|
|
```python
|
|
async def process_video(self, video_url: str, config: PipelineConfig = None) -> str:
|
|
job_id = str(uuid.uuid4())
|
|
result = PipelineResult(job_id=job_id, video_url=video_url, ...)
|
|
self.active_jobs[job_id] = result
|
|
|
|
# Start background processing
|
|
asyncio.create_task(self._execute_pipeline(job_id, config))
|
|
return job_id
|
|
```
|
|
|
|
### Error Handling Pattern
|
|
|
|
```python
|
|
try:
|
|
result = await self.ai_service.generate_summary(request)
|
|
except AIServiceError as e:
|
|
raise HTTPException(status_code=500, detail={
|
|
"error": "AI service error",
|
|
"message": e.message,
|
|
"code": e.error_code
|
|
})
|
|
```
|
|
|
|
## Configuration and Environment
|
|
|
|
### Required Environment Variables
|
|
|
|
```bash
|
|
# Core Services
|
|
ANTHROPIC_API_KEY=sk-ant-... # Required for AI summarization
|
|
YOUTUBE_API_KEY=AIza... # YouTube Data API v3 key
|
|
GOOGLE_API_KEY=AIza... # Google/Gemini API key
|
|
|
|
# Feature Flags
|
|
USE_MOCK_SERVICES=false # Disable mock services
|
|
ENABLE_REAL_TRANSCRIPT_EXTRACTION=true # Enable real transcript extraction
|
|
|
|
# Video Download & Storage Configuration
|
|
VIDEO_DOWNLOAD_STORAGE_PATH=./video_storage # Base storage directory
|
|
VIDEO_DOWNLOAD_KEEP_AUDIO_FILES=true # Save audio for re-transcription
|
|
VIDEO_DOWNLOAD_AUDIO_CLEANUP_DAYS=30 # Audio retention period
|
|
VIDEO_DOWNLOAD_MAX_STORAGE_GB=10 # Storage limit
|
|
|
|
# Dual Transcript Configuration
|
|
# Whisper AI transcription requires additional dependencies:
|
|
# pip install torch whisper pydub yt-dlp pytubefix
|
|
# Optional: CUDA for GPU acceleration
|
|
|
|
# Optional Configuration
|
|
DATABASE_URL=sqlite:///./data/app.db # Database connection
|
|
REDIS_URL=redis://localhost:6379/0 # Cache backend (optional)
|
|
LOG_LEVEL=INFO # Logging level
|
|
CORS_ORIGINS=http://localhost:3000 # Frontend origins
|
|
```
|
|
|
|
### Service Configuration
|
|
|
|
Services are configured through dependency injection with sensible defaults:
|
|
|
|
```python
|
|
# Cost-optimized AI model
|
|
ai_service = AnthropicSummarizer(
|
|
api_key=api_key,
|
|
model="claude-3-5-haiku-20241022" # Cost-effective choice
|
|
)
|
|
|
|
# Cache with TTL
|
|
cache_manager = CacheManager(default_ttl=3600) # 1 hour default
|
|
|
|
# Pipeline with retry logic
|
|
config = PipelineConfig(
|
|
summary_length="standard",
|
|
quality_threshold=0.7,
|
|
max_retries=2,
|
|
enable_notifications=True
|
|
)
|
|
```
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests
|
|
- **Location**: `tests/unit/`
|
|
- **Coverage**: 17+ tests for pipeline orchestration
|
|
- **Mocking**: All external services mocked
|
|
- **Patterns**: Async test patterns with proper fixtures
|
|
|
|
### Integration Tests
|
|
- **Location**: `tests/integration/`
|
|
- **Coverage**: 20+ API endpoint scenarios
|
|
- **Testing**: Full FastAPI integration with TestClient
|
|
- **Validation**: Request/response validation and error handling
|
|
|
|
### Running Tests
|
|
|
|
```bash
|
|
# From backend directory
|
|
PYTHONPATH=/path/to/youtube-summarizer python3 -m pytest tests/unit/ -v
|
|
PYTHONPATH=/path/to/youtube-summarizer python3 -m pytest tests/integration/ -v
|
|
|
|
# With coverage
|
|
python3 -m pytest tests/ --cov=backend --cov-report=html
|
|
```
|
|
|
|
## Common Development Tasks
|
|
|
|
### Adding New API Endpoints
|
|
|
|
1. Create endpoint in appropriate `api/` module
|
|
2. Add business logic to `services/` layer
|
|
3. Update `main.py` to include router
|
|
4. Add unit and integration tests
|
|
5. Update API documentation
|
|
|
|
### Adding New Services
|
|
|
|
1. Create service class in `services/`
|
|
2. Implement proper async patterns
|
|
3. Add error handling with custom exceptions
|
|
4. Create dependency injection function
|
|
5. Add comprehensive unit tests
|
|
|
|
### Debugging Pipeline Issues
|
|
|
|
```python
|
|
# Enable detailed logging
|
|
import logging
|
|
logging.getLogger("backend").setLevel(logging.DEBUG)
|
|
|
|
# Check pipeline status
|
|
pipeline = get_summary_pipeline()
|
|
result = await pipeline.get_pipeline_result(job_id)
|
|
print(f"Status: {result.status}, Error: {result.error}")
|
|
|
|
# Monitor active jobs
|
|
active_jobs = pipeline.get_active_jobs()
|
|
print(f"Active jobs: {len(active_jobs)}")
|
|
```
|
|
|
|
## Performance Optimization
|
|
|
|
### Async Patterns
|
|
- All I/O operations use async/await
|
|
- Background tasks for long-running operations
|
|
- Connection pooling for external services
|
|
- Proper exception handling to prevent blocking
|
|
|
|
### Caching Strategy
|
|
- Pipeline results cached for 1 hour
|
|
- Transcript and metadata cached separately
|
|
- Cache invalidation on video updates
|
|
- Redis-ready for distributed caching
|
|
|
|
### Cost Optimization
|
|
- Claude 3.5 Haiku for 80% cost savings vs GPT-4
|
|
- Intelligent chunking prevents token waste
|
|
- Cost estimation and limits
|
|
- Quality scoring to avoid unnecessary retries
|
|
|
|
## Security Considerations
|
|
|
|
### API Security
|
|
- Environment variable for API keys
|
|
- Input validation on all endpoints
|
|
- Rate limiting (implement with Redis)
|
|
- CORS configuration for frontend origins
|
|
|
|
### Error Sanitization
|
|
```python
|
|
# Never expose internal errors to clients
|
|
except Exception as e:
|
|
logger.error(f"Internal error: {e}")
|
|
raise HTTPException(status_code=500, detail="Internal server error")
|
|
```
|
|
|
|
### Content Validation
|
|
```python
|
|
# Validate transcript length
|
|
if len(request.transcript.strip()) < 50:
|
|
raise HTTPException(status_code=400, detail="Transcript too short")
|
|
```
|
|
|
|
## Monitoring and Observability
|
|
|
|
### Health Checks
|
|
- `/api/health` - Service health status
|
|
- `/api/stats` - Pipeline processing statistics
|
|
- WebSocket connection monitoring
|
|
- Background job tracking
|
|
|
|
### Logging
|
|
- Structured logging with JSON format
|
|
- Error tracking with context
|
|
- Performance metrics logging
|
|
- Request/response logging (without sensitive data)
|
|
|
|
### Metrics
|
|
```python
|
|
# Built-in metrics
|
|
stats = {
|
|
"active_jobs": len(pipeline.get_active_jobs()),
|
|
"cache_stats": await cache_manager.get_cache_stats(),
|
|
"notification_stats": notification_service.get_notification_stats(),
|
|
"websocket_connections": websocket_manager.get_stats()
|
|
}
|
|
```
|
|
|
|
## Deployment Considerations
|
|
|
|
### Production Configuration
|
|
- Use Redis for caching and session storage
|
|
- Configure proper logging (structured JSON)
|
|
- Set up health checks and monitoring
|
|
- Use environment-specific configuration
|
|
- Enable HTTPS and security headers
|
|
|
|
### Scaling Patterns
|
|
- Stateless design enables horizontal scaling
|
|
- Background job processing via task queue
|
|
- Database connection pooling
|
|
- Load balancer health checks
|
|
|
|
### Database Migrations
|
|
```bash
|
|
# When adding database models
|
|
alembic revision --autogenerate -m "Add pipeline models"
|
|
alembic upgrade head
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
**"Anthropic API key not configured"**
|
|
- Solution: Set `ANTHROPIC_API_KEY` environment variable
|
|
|
|
**"Mock data returned instead of real transcripts"**
|
|
- Check: `USE_MOCK_SERVICES=false` in .env
|
|
- Solution: Set `ENABLE_REAL_TRANSCRIPT_EXTRACTION=true`
|
|
|
|
**"404 Not Found for /api/transcripts/extract"**
|
|
- Check: Import statements in main.py
|
|
- Solution: Use `from backend.api.transcripts import router` (not transcripts_stub)
|
|
|
|
**"Radio button selection not working"**
|
|
- Issue: Circular state updates in React
|
|
- Solution: Use ref tracking in useTranscriptSelector hook
|
|
|
|
**Pipeline jobs stuck in "processing" state**
|
|
- Check: `pipeline.get_active_jobs()` for zombie jobs
|
|
- Solution: Restart service or call cleanup endpoint
|
|
|
|
**WebSocket connections not receiving updates**
|
|
- Check: WebSocket connection in browser dev tools
|
|
- Solution: Verify WebSocket manager singleton initialization
|
|
|
|
**High AI costs**
|
|
- Check: Summary length configuration and transcript sizes
|
|
- Solution: Implement cost limits and brief summary defaults
|
|
|
|
**Transcript extraction failures**
|
|
- Check: IntelligentVideoDownloader fallback chain logs
|
|
- Solution: Review which tier failed and check API keys/dependencies
|
|
|
|
### Debug Commands
|
|
|
|
```python
|
|
# Pipeline debugging
|
|
from backend.services.summary_pipeline import SummaryPipeline
|
|
pipeline = SummaryPipeline(...)
|
|
result = await pipeline.get_pipeline_result("job_id")
|
|
|
|
# Cache debugging
|
|
from backend.services.cache_manager import CacheManager
|
|
cache = CacheManager()
|
|
stats = await cache.get_cache_stats()
|
|
|
|
# WebSocket debugging
|
|
from backend.core.websocket_manager import websocket_manager
|
|
connections = websocket_manager.get_stats()
|
|
```
|
|
|
|
This backend is designed for production use with comprehensive error handling, monitoring, and scalability patterns. All services follow async patterns and clean architecture principles. |