7.0 KiB
YouTube Metadata Extraction Service
The YouTube metadata extraction service allows you to extract metadata from YouTube URLs without using the YouTube API. It uses yt-dlp (a fork of youtube-dl) to extract metadata and stores it in the PostgreSQL database.
Features
- ✅ Extract metadata from YouTube URLs using curl/yt-dlp
- ✅ Store metadata in PostgreSQL database
- ✅ Support for various YouTube URL formats
- ✅ CLI commands for easy management
- ✅ Protocol-based architecture for easy testing
- ✅ Comprehensive error handling
- ✅ Health status monitoring
Installation
Prerequisites
-
yt-dlp: Install the YouTube downloader tool
# Using pip pip install yt-dlp # Using uv (recommended) uv pip install yt-dlp # Using system package manager # Ubuntu/Debian sudo apt install yt-dlp # macOS brew install yt-dlp -
Database: Ensure PostgreSQL is running and the database is set up
# Run database migrations alembic upgrade head
Dependencies
Install the required Python packages:
uv pip install -r requirements-youtube.txt
Usage
CLI Commands
The YouTube service is integrated into the main CLI with the youtube command group:
Extract Metadata
# Extract metadata from a YouTube URL
trax youtube extract https://youtube.com/watch?v=dQw4w9WgXcQ
# Force re-extraction even if video exists
trax youtube extract https://youtube.com/watch?v=dQw4w9WgXcQ --force
List Videos
# List recent videos (default: 10)
trax youtube list
# List more videos
trax youtube list --limit 20
# Search by title
trax youtube list --search "python tutorial"
# Filter by channel
trax youtube list --channel "Tech Channel"
Show Video Details
# Show detailed information for a video
trax youtube show dQw4w9WgXcQ
Statistics
# Show YouTube video statistics
trax youtube stats
Delete Video
# Delete a video from database
trax youtube delete dQw4w9WgXcQ
Programmatic Usage
import asyncio
from src.services.youtube_service import YouTubeMetadataService
from src.repositories.youtube_repository import YouTubeRepository
async def example():
# Initialize service
service = YouTubeMetadataService()
await service.initialize()
# Extract and store metadata
video = await service.extract_and_store_metadata(
"https://youtube.com/watch?v=dQw4w9WgXcQ"
)
print(f"Title: {video.title}")
print(f"Channel: {video.channel}")
print(f"Duration: {video.duration_seconds} seconds")
# Use repository for database operations
repo = YouTubeRepository()
videos = await repo.list_all(limit=10)
stats = await repo.get_statistics()
# Run the example
asyncio.run(example())
Supported URL Formats
The service supports various YouTube URL formats:
https://www.youtube.com/watch?v=VIDEO_IDhttps://youtu.be/VIDEO_IDhttps://www.youtube.com/embed/VIDEO_IDhttps://www.youtube.com/v/VIDEO_ID- URLs with additional parameters (e.g.,
&t=30s)
Extracted Metadata
The service extracts the following metadata:
- YouTube ID: Unique identifier for the video
- Title: Video title
- Channel: Uploader/channel name
- Description: Video description
- Duration: Video length in seconds
- URL: Original YouTube URL
- Metadata Extracted At: Timestamp of extraction
Database Schema
The metadata is stored in the youtube_videos table:
CREATE TABLE youtube_videos (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
youtube_id VARCHAR(20) NOT NULL UNIQUE,
title VARCHAR(500) NOT NULL,
channel VARCHAR(200) NOT NULL,
description TEXT,
duration_seconds INTEGER NOT NULL,
url VARCHAR(500) NOT NULL,
metadata_extracted_at TIMESTAMP DEFAULT NOW(),
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
Architecture
The service follows the protocol-based architecture pattern:
Components
-
YouTubeMetadataService: Main service class
- Manages the extraction workflow
- Handles database operations
- Provides health monitoring
-
CurlYouTubeExtractor: Metadata extraction implementation
- Uses
yt-dlpfor metadata extraction - Handles various URL formats
- Provides error handling
- Uses
-
YouTubeRepository: Database operations
- CRUD operations for YouTube videos
- Search and filtering capabilities
- Statistics generation
Protocols
YouTubeMetadataExtractor: Protocol for metadata extractionYouTubeRepositoryProtocol: Protocol for repository operations
This allows for easy testing and swapping implementations.
Error Handling
The service includes comprehensive error handling:
- Invalid URLs: Validates YouTube URL format
- Network Issues: Handles connection timeouts
- yt-dlp Errors: Captures and logs extraction failures
- Database Errors: Handles database connection issues
- Missing Dependencies: Checks for required tools
Testing
Run the tests with:
# Run all YouTube service tests
uv run pytest tests/test_youtube_service.py -v
# Run specific test class
uv run pytest tests/test_youtube_service.py::TestCurlYouTubeExtractor -v
# Run with coverage
uv run pytest tests/test_youtube_service.py --cov=src.services.youtube_service --cov=src.repositories.youtube_repository
Example Script
Run the example script to see the service in action:
uv run python examples/youtube_metadata_example.py
Troubleshooting
Common Issues
-
yt-dlp not found
Error: yt-dlp not availableSolution: Install yt-dlp using pip or your system package manager
-
Database connection error
Error: Could not connect to databaseSolution: Ensure PostgreSQL is running and DATABASE_URL is correct
-
Video not found
Error: Failed to extract metadata: Video not foundSolution: Check if the YouTube URL is valid and accessible
-
Permission denied
Error: Permission denied when running yt-dlpSolution: Ensure yt-dlp has execute permissions
Health Check
Check service health:
service = YouTubeMetadataService()
health = service.get_health_status()
print(health)
This will show:
- Service status
- yt-dlp availability
- Cache directory location
Performance
- Extraction Time: ~2-5 seconds per video (depends on network)
- Database Operations: <100ms for most operations
- Memory Usage: <50MB for typical usage
- Concurrent Requests: Limited by yt-dlp and database connections
Security Considerations
- No API keys required (uses public YouTube data)
- Local caching for performance
- Input validation for URLs
- SQL injection protection via parameterized queries
- No sensitive data stored
Future Enhancements
- Batch processing for multiple URLs
- Caching extracted metadata
- Support for playlists
- Video thumbnail extraction
- Automatic metadata refresh
- Integration with transcription pipeline