trax/docs/youtube-service.md

294 lines
7.0 KiB
Markdown

# YouTube Metadata Extraction Service
The YouTube metadata extraction service allows you to extract metadata from YouTube URLs without using the YouTube API. It uses `yt-dlp` (a fork of youtube-dl) to extract metadata and stores it in the PostgreSQL database.
## Features
- ✅ Extract metadata from YouTube URLs using curl/yt-dlp
- ✅ Store metadata in PostgreSQL database
- ✅ Support for various YouTube URL formats
- ✅ CLI commands for easy management
- ✅ Protocol-based architecture for easy testing
- ✅ Comprehensive error handling
- ✅ Health status monitoring
## Installation
### Prerequisites
1. **yt-dlp**: Install the YouTube downloader tool
```bash
# Using pip
pip install yt-dlp
# Using uv (recommended)
uv pip install yt-dlp
# Using system package manager
# Ubuntu/Debian
sudo apt install yt-dlp
# macOS
brew install yt-dlp
```
2. **Database**: Ensure PostgreSQL is running and the database is set up
```bash
# Run database migrations
alembic upgrade head
```
### Dependencies
Install the required Python packages:
```bash
uv pip install -r requirements-youtube.txt
```
## Usage
### CLI Commands
The YouTube service is integrated into the main CLI with the `youtube` command group:
#### Extract Metadata
```bash
# Extract metadata from a YouTube URL
trax youtube extract https://youtube.com/watch?v=dQw4w9WgXcQ
# Force re-extraction even if video exists
trax youtube extract https://youtube.com/watch?v=dQw4w9WgXcQ --force
```
#### List Videos
```bash
# List recent videos (default: 10)
trax youtube list
# List more videos
trax youtube list --limit 20
# Search by title
trax youtube list --search "python tutorial"
# Filter by channel
trax youtube list --channel "Tech Channel"
```
#### Show Video Details
```bash
# Show detailed information for a video
trax youtube show dQw4w9WgXcQ
```
#### Statistics
```bash
# Show YouTube video statistics
trax youtube stats
```
#### Delete Video
```bash
# Delete a video from database
trax youtube delete dQw4w9WgXcQ
```
### Programmatic Usage
```python
import asyncio
from src.services.youtube_service import YouTubeMetadataService
from src.repositories.youtube_repository import YouTubeRepository
async def example():
# Initialize service
service = YouTubeMetadataService()
await service.initialize()
# Extract and store metadata
video = await service.extract_and_store_metadata(
"https://youtube.com/watch?v=dQw4w9WgXcQ"
)
print(f"Title: {video.title}")
print(f"Channel: {video.channel}")
print(f"Duration: {video.duration_seconds} seconds")
# Use repository for database operations
repo = YouTubeRepository()
videos = await repo.list_all(limit=10)
stats = await repo.get_statistics()
# Run the example
asyncio.run(example())
```
## Supported URL Formats
The service supports various YouTube URL formats:
- `https://www.youtube.com/watch?v=VIDEO_ID`
- `https://youtu.be/VIDEO_ID`
- `https://www.youtube.com/embed/VIDEO_ID`
- `https://www.youtube.com/v/VIDEO_ID`
- URLs with additional parameters (e.g., `&t=30s`)
## Extracted Metadata
The service extracts the following metadata:
- **YouTube ID**: Unique identifier for the video
- **Title**: Video title
- **Channel**: Uploader/channel name
- **Description**: Video description
- **Duration**: Video length in seconds
- **URL**: Original YouTube URL
- **Metadata Extracted At**: Timestamp of extraction
## Database Schema
The metadata is stored in the `youtube_videos` table:
```sql
CREATE TABLE youtube_videos (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
youtube_id VARCHAR(20) NOT NULL UNIQUE,
title VARCHAR(500) NOT NULL,
channel VARCHAR(200) NOT NULL,
description TEXT,
duration_seconds INTEGER NOT NULL,
url VARCHAR(500) NOT NULL,
metadata_extracted_at TIMESTAMP DEFAULT NOW(),
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
```
## Architecture
The service follows the protocol-based architecture pattern:
### Components
1. **YouTubeMetadataService**: Main service class
- Manages the extraction workflow
- Handles database operations
- Provides health monitoring
2. **CurlYouTubeExtractor**: Metadata extraction implementation
- Uses `yt-dlp` for metadata extraction
- Handles various URL formats
- Provides error handling
3. **YouTubeRepository**: Database operations
- CRUD operations for YouTube videos
- Search and filtering capabilities
- Statistics generation
### Protocols
- `YouTubeMetadataExtractor`: Protocol for metadata extraction
- `YouTubeRepositoryProtocol`: Protocol for repository operations
This allows for easy testing and swapping implementations.
## Error Handling
The service includes comprehensive error handling:
- **Invalid URLs**: Validates YouTube URL format
- **Network Issues**: Handles connection timeouts
- **yt-dlp Errors**: Captures and logs extraction failures
- **Database Errors**: Handles database connection issues
- **Missing Dependencies**: Checks for required tools
## Testing
Run the tests with:
```bash
# Run all YouTube service tests
uv run pytest tests/test_youtube_service.py -v
# Run specific test class
uv run pytest tests/test_youtube_service.py::TestCurlYouTubeExtractor -v
# Run with coverage
uv run pytest tests/test_youtube_service.py --cov=src.services.youtube_service --cov=src.repositories.youtube_repository
```
## Example Script
Run the example script to see the service in action:
```bash
uv run python examples/youtube_metadata_example.py
```
## Troubleshooting
### Common Issues
1. **yt-dlp not found**
```
Error: yt-dlp not available
```
**Solution**: Install yt-dlp using pip or your system package manager
2. **Database connection error**
```
Error: Could not connect to database
```
**Solution**: Ensure PostgreSQL is running and DATABASE_URL is correct
3. **Video not found**
```
Error: Failed to extract metadata: Video not found
```
**Solution**: Check if the YouTube URL is valid and accessible
4. **Permission denied**
```
Error: Permission denied when running yt-dlp
```
**Solution**: Ensure yt-dlp has execute permissions
### Health Check
Check service health:
```python
service = YouTubeMetadataService()
health = service.get_health_status()
print(health)
```
This will show:
- Service status
- yt-dlp availability
- Cache directory location
## Performance
- **Extraction Time**: ~2-5 seconds per video (depends on network)
- **Database Operations**: <100ms for most operations
- **Memory Usage**: <50MB for typical usage
- **Concurrent Requests**: Limited by yt-dlp and database connections
## Security Considerations
- No API keys required (uses public YouTube data)
- Local caching for performance
- Input validation for URLs
- SQL injection protection via parameterized queries
- No sensitive data stored
## Future Enhancements
- [ ] Batch processing for multiple URLs
- [ ] Caching extracted metadata
- [ ] Support for playlists
- [ ] Video thumbnail extraction
- [ ] Automatic metadata refresh
- [ ] Integration with transcription pipeline