trax/docs/youtube-service.md

7.0 KiB

YouTube Metadata Extraction Service

The YouTube metadata extraction service allows you to extract metadata from YouTube URLs without using the YouTube API. It uses yt-dlp (a fork of youtube-dl) to extract metadata and stores it in the PostgreSQL database.

Features

  • Extract metadata from YouTube URLs using curl/yt-dlp
  • Store metadata in PostgreSQL database
  • Support for various YouTube URL formats
  • CLI commands for easy management
  • Protocol-based architecture for easy testing
  • Comprehensive error handling
  • Health status monitoring

Installation

Prerequisites

  1. yt-dlp: Install the YouTube downloader tool

    # Using pip
    pip install yt-dlp
    
    # Using uv (recommended)
    uv pip install yt-dlp
    
    # Using system package manager
    # Ubuntu/Debian
    sudo apt install yt-dlp
    
    # macOS
    brew install yt-dlp
    
  2. Database: Ensure PostgreSQL is running and the database is set up

    # Run database migrations
    alembic upgrade head
    

Dependencies

Install the required Python packages:

uv pip install -r requirements-youtube.txt

Usage

CLI Commands

The YouTube service is integrated into the main CLI with the youtube command group:

Extract Metadata

# Extract metadata from a YouTube URL
trax youtube extract https://youtube.com/watch?v=dQw4w9WgXcQ

# Force re-extraction even if video exists
trax youtube extract https://youtube.com/watch?v=dQw4w9WgXcQ --force

List Videos

# List recent videos (default: 10)
trax youtube list

# List more videos
trax youtube list --limit 20

# Search by title
trax youtube list --search "python tutorial"

# Filter by channel
trax youtube list --channel "Tech Channel"

Show Video Details

# Show detailed information for a video
trax youtube show dQw4w9WgXcQ

Statistics

# Show YouTube video statistics
trax youtube stats

Delete Video

# Delete a video from database
trax youtube delete dQw4w9WgXcQ

Programmatic Usage

import asyncio
from src.services.youtube_service import YouTubeMetadataService
from src.repositories.youtube_repository import YouTubeRepository

async def example():
    # Initialize service
    service = YouTubeMetadataService()
    await service.initialize()
    
    # Extract and store metadata
    video = await service.extract_and_store_metadata(
        "https://youtube.com/watch?v=dQw4w9WgXcQ"
    )
    
    print(f"Title: {video.title}")
    print(f"Channel: {video.channel}")
    print(f"Duration: {video.duration_seconds} seconds")
    
    # Use repository for database operations
    repo = YouTubeRepository()
    videos = await repo.list_all(limit=10)
    stats = await repo.get_statistics()

# Run the example
asyncio.run(example())

Supported URL Formats

The service supports various YouTube URL formats:

  • https://www.youtube.com/watch?v=VIDEO_ID
  • https://youtu.be/VIDEO_ID
  • https://www.youtube.com/embed/VIDEO_ID
  • https://www.youtube.com/v/VIDEO_ID
  • URLs with additional parameters (e.g., &t=30s)

Extracted Metadata

The service extracts the following metadata:

  • YouTube ID: Unique identifier for the video
  • Title: Video title
  • Channel: Uploader/channel name
  • Description: Video description
  • Duration: Video length in seconds
  • URL: Original YouTube URL
  • Metadata Extracted At: Timestamp of extraction

Database Schema

The metadata is stored in the youtube_videos table:

CREATE TABLE youtube_videos (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    youtube_id VARCHAR(20) NOT NULL UNIQUE,
    title VARCHAR(500) NOT NULL,
    channel VARCHAR(200) NOT NULL,
    description TEXT,
    duration_seconds INTEGER NOT NULL,
    url VARCHAR(500) NOT NULL,
    metadata_extracted_at TIMESTAMP DEFAULT NOW(),
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

Architecture

The service follows the protocol-based architecture pattern:

Components

  1. YouTubeMetadataService: Main service class

    • Manages the extraction workflow
    • Handles database operations
    • Provides health monitoring
  2. CurlYouTubeExtractor: Metadata extraction implementation

    • Uses yt-dlp for metadata extraction
    • Handles various URL formats
    • Provides error handling
  3. YouTubeRepository: Database operations

    • CRUD operations for YouTube videos
    • Search and filtering capabilities
    • Statistics generation

Protocols

  • YouTubeMetadataExtractor: Protocol for metadata extraction
  • YouTubeRepositoryProtocol: Protocol for repository operations

This allows for easy testing and swapping implementations.

Error Handling

The service includes comprehensive error handling:

  • Invalid URLs: Validates YouTube URL format
  • Network Issues: Handles connection timeouts
  • yt-dlp Errors: Captures and logs extraction failures
  • Database Errors: Handles database connection issues
  • Missing Dependencies: Checks for required tools

Testing

Run the tests with:

# Run all YouTube service tests
uv run pytest tests/test_youtube_service.py -v

# Run specific test class
uv run pytest tests/test_youtube_service.py::TestCurlYouTubeExtractor -v

# Run with coverage
uv run pytest tests/test_youtube_service.py --cov=src.services.youtube_service --cov=src.repositories.youtube_repository

Example Script

Run the example script to see the service in action:

uv run python examples/youtube_metadata_example.py

Troubleshooting

Common Issues

  1. yt-dlp not found

    Error: yt-dlp not available
    

    Solution: Install yt-dlp using pip or your system package manager

  2. Database connection error

    Error: Could not connect to database
    

    Solution: Ensure PostgreSQL is running and DATABASE_URL is correct

  3. Video not found

    Error: Failed to extract metadata: Video not found
    

    Solution: Check if the YouTube URL is valid and accessible

  4. Permission denied

    Error: Permission denied when running yt-dlp
    

    Solution: Ensure yt-dlp has execute permissions

Health Check

Check service health:

service = YouTubeMetadataService()
health = service.get_health_status()
print(health)

This will show:

  • Service status
  • yt-dlp availability
  • Cache directory location

Performance

  • Extraction Time: ~2-5 seconds per video (depends on network)
  • Database Operations: <100ms for most operations
  • Memory Usage: <50MB for typical usage
  • Concurrent Requests: Limited by yt-dlp and database connections

Security Considerations

  • No API keys required (uses public YouTube data)
  • Local caching for performance
  • Input validation for URLs
  • SQL injection protection via parameterized queries
  • No sensitive data stored

Future Enhancements

  • Batch processing for multiple URLs
  • Caching extracted metadata
  • Support for playlists
  • Video thumbnail extraction
  • Automatic metadata refresh
  • Integration with transcription pipeline