294 lines
7.0 KiB
Markdown
294 lines
7.0 KiB
Markdown
# YouTube Metadata Extraction Service
|
|
|
|
The YouTube metadata extraction service allows you to extract metadata from YouTube URLs without using the YouTube API. It uses `yt-dlp` (a fork of youtube-dl) to extract metadata and stores it in the PostgreSQL database.
|
|
|
|
## Features
|
|
|
|
- ✅ Extract metadata from YouTube URLs using curl/yt-dlp
|
|
- ✅ Store metadata in PostgreSQL database
|
|
- ✅ Support for various YouTube URL formats
|
|
- ✅ CLI commands for easy management
|
|
- ✅ Protocol-based architecture for easy testing
|
|
- ✅ Comprehensive error handling
|
|
- ✅ Health status monitoring
|
|
|
|
## Installation
|
|
|
|
### Prerequisites
|
|
|
|
1. **yt-dlp**: Install the YouTube downloader tool
|
|
```bash
|
|
# Using pip
|
|
pip install yt-dlp
|
|
|
|
# Using uv (recommended)
|
|
uv pip install yt-dlp
|
|
|
|
# Using system package manager
|
|
# Ubuntu/Debian
|
|
sudo apt install yt-dlp
|
|
|
|
# macOS
|
|
brew install yt-dlp
|
|
```
|
|
|
|
2. **Database**: Ensure PostgreSQL is running and the database is set up
|
|
```bash
|
|
# Run database migrations
|
|
alembic upgrade head
|
|
```
|
|
|
|
### Dependencies
|
|
|
|
Install the required Python packages:
|
|
```bash
|
|
uv pip install -r requirements-youtube.txt
|
|
```
|
|
|
|
## Usage
|
|
|
|
### CLI Commands
|
|
|
|
The YouTube service is integrated into the main CLI with the `youtube` command group:
|
|
|
|
#### Extract Metadata
|
|
```bash
|
|
# Extract metadata from a YouTube URL
|
|
trax youtube extract https://youtube.com/watch?v=dQw4w9WgXcQ
|
|
|
|
# Force re-extraction even if video exists
|
|
trax youtube extract https://youtube.com/watch?v=dQw4w9WgXcQ --force
|
|
```
|
|
|
|
#### List Videos
|
|
```bash
|
|
# List recent videos (default: 10)
|
|
trax youtube list
|
|
|
|
# List more videos
|
|
trax youtube list --limit 20
|
|
|
|
# Search by title
|
|
trax youtube list --search "python tutorial"
|
|
|
|
# Filter by channel
|
|
trax youtube list --channel "Tech Channel"
|
|
```
|
|
|
|
#### Show Video Details
|
|
```bash
|
|
# Show detailed information for a video
|
|
trax youtube show dQw4w9WgXcQ
|
|
```
|
|
|
|
#### Statistics
|
|
```bash
|
|
# Show YouTube video statistics
|
|
trax youtube stats
|
|
```
|
|
|
|
#### Delete Video
|
|
```bash
|
|
# Delete a video from database
|
|
trax youtube delete dQw4w9WgXcQ
|
|
```
|
|
|
|
### Programmatic Usage
|
|
|
|
```python
|
|
import asyncio
|
|
from src.services.youtube_service import YouTubeMetadataService
|
|
from src.repositories.youtube_repository import YouTubeRepository
|
|
|
|
async def example():
|
|
# Initialize service
|
|
service = YouTubeMetadataService()
|
|
await service.initialize()
|
|
|
|
# Extract and store metadata
|
|
video = await service.extract_and_store_metadata(
|
|
"https://youtube.com/watch?v=dQw4w9WgXcQ"
|
|
)
|
|
|
|
print(f"Title: {video.title}")
|
|
print(f"Channel: {video.channel}")
|
|
print(f"Duration: {video.duration_seconds} seconds")
|
|
|
|
# Use repository for database operations
|
|
repo = YouTubeRepository()
|
|
videos = await repo.list_all(limit=10)
|
|
stats = await repo.get_statistics()
|
|
|
|
# Run the example
|
|
asyncio.run(example())
|
|
```
|
|
|
|
## Supported URL Formats
|
|
|
|
The service supports various YouTube URL formats:
|
|
|
|
- `https://www.youtube.com/watch?v=VIDEO_ID`
|
|
- `https://youtu.be/VIDEO_ID`
|
|
- `https://www.youtube.com/embed/VIDEO_ID`
|
|
- `https://www.youtube.com/v/VIDEO_ID`
|
|
- URLs with additional parameters (e.g., `&t=30s`)
|
|
|
|
## Extracted Metadata
|
|
|
|
The service extracts the following metadata:
|
|
|
|
- **YouTube ID**: Unique identifier for the video
|
|
- **Title**: Video title
|
|
- **Channel**: Uploader/channel name
|
|
- **Description**: Video description
|
|
- **Duration**: Video length in seconds
|
|
- **URL**: Original YouTube URL
|
|
- **Metadata Extracted At**: Timestamp of extraction
|
|
|
|
## Database Schema
|
|
|
|
The metadata is stored in the `youtube_videos` table:
|
|
|
|
```sql
|
|
CREATE TABLE youtube_videos (
|
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
youtube_id VARCHAR(20) NOT NULL UNIQUE,
|
|
title VARCHAR(500) NOT NULL,
|
|
channel VARCHAR(200) NOT NULL,
|
|
description TEXT,
|
|
duration_seconds INTEGER NOT NULL,
|
|
url VARCHAR(500) NOT NULL,
|
|
metadata_extracted_at TIMESTAMP DEFAULT NOW(),
|
|
created_at TIMESTAMP DEFAULT NOW(),
|
|
updated_at TIMESTAMP DEFAULT NOW()
|
|
);
|
|
```
|
|
|
|
## Architecture
|
|
|
|
The service follows the protocol-based architecture pattern:
|
|
|
|
### Components
|
|
|
|
1. **YouTubeMetadataService**: Main service class
|
|
- Manages the extraction workflow
|
|
- Handles database operations
|
|
- Provides health monitoring
|
|
|
|
2. **CurlYouTubeExtractor**: Metadata extraction implementation
|
|
- Uses `yt-dlp` for metadata extraction
|
|
- Handles various URL formats
|
|
- Provides error handling
|
|
|
|
3. **YouTubeRepository**: Database operations
|
|
- CRUD operations for YouTube videos
|
|
- Search and filtering capabilities
|
|
- Statistics generation
|
|
|
|
### Protocols
|
|
|
|
- `YouTubeMetadataExtractor`: Protocol for metadata extraction
|
|
- `YouTubeRepositoryProtocol`: Protocol for repository operations
|
|
|
|
This allows for easy testing and swapping implementations.
|
|
|
|
## Error Handling
|
|
|
|
The service includes comprehensive error handling:
|
|
|
|
- **Invalid URLs**: Validates YouTube URL format
|
|
- **Network Issues**: Handles connection timeouts
|
|
- **yt-dlp Errors**: Captures and logs extraction failures
|
|
- **Database Errors**: Handles database connection issues
|
|
- **Missing Dependencies**: Checks for required tools
|
|
|
|
## Testing
|
|
|
|
Run the tests with:
|
|
|
|
```bash
|
|
# Run all YouTube service tests
|
|
uv run pytest tests/test_youtube_service.py -v
|
|
|
|
# Run specific test class
|
|
uv run pytest tests/test_youtube_service.py::TestCurlYouTubeExtractor -v
|
|
|
|
# Run with coverage
|
|
uv run pytest tests/test_youtube_service.py --cov=src.services.youtube_service --cov=src.repositories.youtube_repository
|
|
```
|
|
|
|
## Example Script
|
|
|
|
Run the example script to see the service in action:
|
|
|
|
```bash
|
|
uv run python examples/youtube_metadata_example.py
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **yt-dlp not found**
|
|
```
|
|
Error: yt-dlp not available
|
|
```
|
|
**Solution**: Install yt-dlp using pip or your system package manager
|
|
|
|
2. **Database connection error**
|
|
```
|
|
Error: Could not connect to database
|
|
```
|
|
**Solution**: Ensure PostgreSQL is running and DATABASE_URL is correct
|
|
|
|
3. **Video not found**
|
|
```
|
|
Error: Failed to extract metadata: Video not found
|
|
```
|
|
**Solution**: Check if the YouTube URL is valid and accessible
|
|
|
|
4. **Permission denied**
|
|
```
|
|
Error: Permission denied when running yt-dlp
|
|
```
|
|
**Solution**: Ensure yt-dlp has execute permissions
|
|
|
|
### Health Check
|
|
|
|
Check service health:
|
|
|
|
```python
|
|
service = YouTubeMetadataService()
|
|
health = service.get_health_status()
|
|
print(health)
|
|
```
|
|
|
|
This will show:
|
|
- Service status
|
|
- yt-dlp availability
|
|
- Cache directory location
|
|
|
|
## Performance
|
|
|
|
- **Extraction Time**: ~2-5 seconds per video (depends on network)
|
|
- **Database Operations**: <100ms for most operations
|
|
- **Memory Usage**: <50MB for typical usage
|
|
- **Concurrent Requests**: Limited by yt-dlp and database connections
|
|
|
|
## Security Considerations
|
|
|
|
- No API keys required (uses public YouTube data)
|
|
- Local caching for performance
|
|
- Input validation for URLs
|
|
- SQL injection protection via parameterized queries
|
|
- No sensitive data stored
|
|
|
|
## Future Enhancements
|
|
|
|
- [ ] Batch processing for multiple URLs
|
|
- [ ] Caching extracted metadata
|
|
- [ ] Support for playlists
|
|
- [ ] Video thumbnail extraction
|
|
- [ ] Automatic metadata refresh
|
|
- [ ] Integration with transcription pipeline
|