# YouTube Metadata Extraction Service The YouTube metadata extraction service allows you to extract metadata from YouTube URLs without using the YouTube API. It uses `yt-dlp` (a fork of youtube-dl) to extract metadata and stores it in the PostgreSQL database. ## Features - ✅ Extract metadata from YouTube URLs using curl/yt-dlp - ✅ Store metadata in PostgreSQL database - ✅ Support for various YouTube URL formats - ✅ CLI commands for easy management - ✅ Protocol-based architecture for easy testing - ✅ Comprehensive error handling - ✅ Health status monitoring ## Installation ### Prerequisites 1. **yt-dlp**: Install the YouTube downloader tool ```bash # Using pip pip install yt-dlp # Using uv (recommended) uv pip install yt-dlp # Using system package manager # Ubuntu/Debian sudo apt install yt-dlp # macOS brew install yt-dlp ``` 2. **Database**: Ensure PostgreSQL is running and the database is set up ```bash # Run database migrations alembic upgrade head ``` ### Dependencies Install the required Python packages: ```bash uv pip install -r requirements-youtube.txt ``` ## Usage ### CLI Commands The YouTube service is integrated into the main CLI with the `youtube` command group: #### Extract Metadata ```bash # Extract metadata from a YouTube URL trax youtube extract https://youtube.com/watch?v=dQw4w9WgXcQ # Force re-extraction even if video exists trax youtube extract https://youtube.com/watch?v=dQw4w9WgXcQ --force ``` #### List Videos ```bash # List recent videos (default: 10) trax youtube list # List more videos trax youtube list --limit 20 # Search by title trax youtube list --search "python tutorial" # Filter by channel trax youtube list --channel "Tech Channel" ``` #### Show Video Details ```bash # Show detailed information for a video trax youtube show dQw4w9WgXcQ ``` #### Statistics ```bash # Show YouTube video statistics trax youtube stats ``` #### Delete Video ```bash # Delete a video from database trax youtube delete dQw4w9WgXcQ ``` ### Programmatic Usage ```python import asyncio from src.services.youtube_service import YouTubeMetadataService from src.repositories.youtube_repository import YouTubeRepository async def example(): # Initialize service service = YouTubeMetadataService() await service.initialize() # Extract and store metadata video = await service.extract_and_store_metadata( "https://youtube.com/watch?v=dQw4w9WgXcQ" ) print(f"Title: {video.title}") print(f"Channel: {video.channel}") print(f"Duration: {video.duration_seconds} seconds") # Use repository for database operations repo = YouTubeRepository() videos = await repo.list_all(limit=10) stats = await repo.get_statistics() # Run the example asyncio.run(example()) ``` ## Supported URL Formats The service supports various YouTube URL formats: - `https://www.youtube.com/watch?v=VIDEO_ID` - `https://youtu.be/VIDEO_ID` - `https://www.youtube.com/embed/VIDEO_ID` - `https://www.youtube.com/v/VIDEO_ID` - URLs with additional parameters (e.g., `&t=30s`) ## Extracted Metadata The service extracts the following metadata: - **YouTube ID**: Unique identifier for the video - **Title**: Video title - **Channel**: Uploader/channel name - **Description**: Video description - **Duration**: Video length in seconds - **URL**: Original YouTube URL - **Metadata Extracted At**: Timestamp of extraction ## Database Schema The metadata is stored in the `youtube_videos` table: ```sql CREATE TABLE youtube_videos ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), youtube_id VARCHAR(20) NOT NULL UNIQUE, title VARCHAR(500) NOT NULL, channel VARCHAR(200) NOT NULL, description TEXT, duration_seconds INTEGER NOT NULL, url VARCHAR(500) NOT NULL, metadata_extracted_at TIMESTAMP DEFAULT NOW(), created_at TIMESTAMP DEFAULT NOW(), updated_at TIMESTAMP DEFAULT NOW() ); ``` ## Architecture The service follows the protocol-based architecture pattern: ### Components 1. **YouTubeMetadataService**: Main service class - Manages the extraction workflow - Handles database operations - Provides health monitoring 2. **CurlYouTubeExtractor**: Metadata extraction implementation - Uses `yt-dlp` for metadata extraction - Handles various URL formats - Provides error handling 3. **YouTubeRepository**: Database operations - CRUD operations for YouTube videos - Search and filtering capabilities - Statistics generation ### Protocols - `YouTubeMetadataExtractor`: Protocol for metadata extraction - `YouTubeRepositoryProtocol`: Protocol for repository operations This allows for easy testing and swapping implementations. ## Error Handling The service includes comprehensive error handling: - **Invalid URLs**: Validates YouTube URL format - **Network Issues**: Handles connection timeouts - **yt-dlp Errors**: Captures and logs extraction failures - **Database Errors**: Handles database connection issues - **Missing Dependencies**: Checks for required tools ## Testing Run the tests with: ```bash # Run all YouTube service tests uv run pytest tests/test_youtube_service.py -v # Run specific test class uv run pytest tests/test_youtube_service.py::TestCurlYouTubeExtractor -v # Run with coverage uv run pytest tests/test_youtube_service.py --cov=src.services.youtube_service --cov=src.repositories.youtube_repository ``` ## Example Script Run the example script to see the service in action: ```bash uv run python examples/youtube_metadata_example.py ``` ## Troubleshooting ### Common Issues 1. **yt-dlp not found** ``` Error: yt-dlp not available ``` **Solution**: Install yt-dlp using pip or your system package manager 2. **Database connection error** ``` Error: Could not connect to database ``` **Solution**: Ensure PostgreSQL is running and DATABASE_URL is correct 3. **Video not found** ``` Error: Failed to extract metadata: Video not found ``` **Solution**: Check if the YouTube URL is valid and accessible 4. **Permission denied** ``` Error: Permission denied when running yt-dlp ``` **Solution**: Ensure yt-dlp has execute permissions ### Health Check Check service health: ```python service = YouTubeMetadataService() health = service.get_health_status() print(health) ``` This will show: - Service status - yt-dlp availability - Cache directory location ## Performance - **Extraction Time**: ~2-5 seconds per video (depends on network) - **Database Operations**: <100ms for most operations - **Memory Usage**: <50MB for typical usage - **Concurrent Requests**: Limited by yt-dlp and database connections ## Security Considerations - No API keys required (uses public YouTube data) - Local caching for performance - Input validation for URLs - SQL injection protection via parameterized queries - No sensitive data stored ## Future Enhancements - [ ] Batch processing for multiple URLs - [ ] Caching extracted metadata - [ ] Support for playlists - [ ] Video thumbnail extraction - [ ] Automatic metadata refresh - [ ] Integration with transcription pipeline