7.0 KiB

Raw Blame History

YouTube Metadata Extraction Service

The YouTube metadata extraction service allows you to extract metadata from YouTube URLs without using the YouTube API. It uses yt-dlp (a fork of youtube-dl) to extract metadata and stores it in the PostgreSQL database.

Features

✅ Extract metadata from YouTube URLs using curl/yt-dlp
✅ Store metadata in PostgreSQL database
✅ Support for various YouTube URL formats
✅ CLI commands for easy management
✅ Protocol-based architecture for easy testing
✅ Comprehensive error handling
✅ Health status monitoring

Installation

Prerequisites

yt-dlp: Install the YouTube downloader tool

# Using pip
pip install yt-dlp

# Using uv (recommended)
uv pip install yt-dlp

# Using system package manager
# Ubuntu/Debian
sudo apt install yt-dlp

# macOS
brew install yt-dlp

Database: Ensure PostgreSQL is running and the database is set up
```
# Run database migrations
alembic upgrade head
```

Dependencies

Install the required Python packages:

uv pip install -r requirements-youtube.txt

Usage

CLI Commands

The YouTube service is integrated into the main CLI with the youtube command group:

Extract Metadata

# Extract metadata from a YouTube URL
trax youtube extract https://youtube.com/watch?v=dQw4w9WgXcQ

# Force re-extraction even if video exists
trax youtube extract https://youtube.com/watch?v=dQw4w9WgXcQ --force

# List recent videos (default: 10)
trax youtube list

# List more videos
trax youtube list --limit 20

# Search by title
trax youtube list --search "python tutorial"

# Filter by channel
trax youtube list --channel "Tech Channel"

Show Video Details

# Show detailed information for a video
trax youtube show dQw4w9WgXcQ

Statistics

# Show YouTube video statistics
trax youtube stats

Delete Video

# Delete a video from database
trax youtube delete dQw4w9WgXcQ

Programmatic Usage

import asyncio
from src.services.youtube_service import YouTubeMetadataService
from src.repositories.youtube_repository import YouTubeRepository

async def example():
    # Initialize service
    service = YouTubeMetadataService()
    await service.initialize()
    
    # Extract and store metadata
    video = await service.extract_and_store_metadata(
        "https://youtube.com/watch?v=dQw4w9WgXcQ"
    )
    
    print(f"Title: {video.title}")
    print(f"Channel: {video.channel}")
    print(f"Duration: {video.duration_seconds} seconds")
    
    # Use repository for database operations
    repo = YouTubeRepository()
    videos = await repo.list_all(limit=10)
    stats = await repo.get_statistics()

# Run the example
asyncio.run(example())

Supported URL Formats

The service supports various YouTube URL formats:

https://www.youtube.com/watch?v=VIDEO_ID
https://youtu.be/VIDEO_ID
https://www.youtube.com/embed/VIDEO_ID
https://www.youtube.com/v/VIDEO_ID
URLs with additional parameters (e.g., &t=30s)

Extracted Metadata

The service extracts the following metadata:

YouTube ID: Unique identifier for the video
Title: Video title
Channel: Uploader/channel name
Description: Video description
Duration: Video length in seconds
URL: Original YouTube URL
Metadata Extracted At: Timestamp of extraction

Database Schema

The metadata is stored in the youtube_videos table:

CREATE TABLE youtube_videos (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    youtube_id VARCHAR(20) NOT NULL UNIQUE,
    title VARCHAR(500) NOT NULL,
    channel VARCHAR(200) NOT NULL,
    description TEXT,
    duration_seconds INTEGER NOT NULL,
    url VARCHAR(500) NOT NULL,
    metadata_extracted_at TIMESTAMP DEFAULT NOW(),
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

Architecture

The service follows the protocol-based architecture pattern:

Components

YouTubeMetadataService: Main service class
- Manages the extraction workflow
- Handles database operations
- Provides health monitoring
CurlYouTubeExtractor: Metadata extraction implementation
- Uses yt-dlp for metadata extraction
- Handles various URL formats
- Provides error handling
YouTubeRepository: Database operations
- CRUD operations for YouTube videos
- Search and filtering capabilities
- Statistics generation

Protocols

YouTubeMetadataExtractor: Protocol for metadata extraction
YouTubeRepositoryProtocol: Protocol for repository operations

This allows for easy testing and swapping implementations.

Error Handling

The service includes comprehensive error handling:

Invalid URLs: Validates YouTube URL format
Network Issues: Handles connection timeouts
yt-dlp Errors: Captures and logs extraction failures
Database Errors: Handles database connection issues
Missing Dependencies: Checks for required tools

Testing

Run the tests with:

# Run all YouTube service tests
uv run pytest tests/test_youtube_service.py -v

# Run specific test class
uv run pytest tests/test_youtube_service.py::TestCurlYouTubeExtractor -v

# Run with coverage
uv run pytest tests/test_youtube_service.py --cov=src.services.youtube_service --cov=src.repositories.youtube_repository

Example Script

Run the example script to see the service in action:

uv run python examples/youtube_metadata_example.py

Troubleshooting

Common Issues

yt-dlp not found
```
Error: yt-dlp not available
```
Solution: Install yt-dlp using pip or your system package manager
Database connection error
```
Error: Could not connect to database
```
Solution: Ensure PostgreSQL is running and DATABASE_URL is correct
Video not found
```
Error: Failed to extract metadata: Video not found
```
Solution: Check if the YouTube URL is valid and accessible
Permission denied
```
Error: Permission denied when running yt-dlp
```
Solution: Ensure yt-dlp has execute permissions

Health Check

Check service health:

service = YouTubeMetadataService()
health = service.get_health_status()
print(health)

This will show:

Service status
yt-dlp availability
Cache directory location

Performance

Extraction Time: ~2-5 seconds per video (depends on network)
Database Operations: <100ms for most operations
Memory Usage: <50MB for typical usage
Concurrent Requests: Limited by yt-dlp and database connections

Security Considerations

No API keys required (uses public YouTube data)
Local caching for performance
Input validation for URLs
SQL injection protection via parameterized queries
No sensitive data stored

Future Enhancements

Batch processing for multiple URLs
Caching extracted metadata
Support for playlists
Video thumbnail extraction
Automatic metadata refresh
Integration with transcription pipeline

7.0 KiB Raw Blame History