trax/docs/youtube-service.md

# YouTube Metadata Extraction Service

The YouTube metadata extraction service allows you to extract metadata from YouTube URLs without using the YouTube API. It uses `yt-dlp` (a fork of youtube-dl) to extract metadata and stores it in the PostgreSQL database.

## Features

- ✅ Extract metadata from YouTube URLs using curl/yt-dlp
- ✅ Store metadata in PostgreSQL database
- ✅ Support for various YouTube URL formats
- ✅ CLI commands for easy management
- ✅ Protocol-based architecture for easy testing
- ✅ Comprehensive error handling
- ✅ Health status monitoring

## Installation

### Prerequisites

1. **yt-dlp**: Install the YouTube downloader tool
   ```bash
   # Using pip
   pip install yt-dlp

   # Using uv (recommended)
   uv pip install yt-dlp

   # Using system package manager
   # Ubuntu/Debian
   sudo apt install yt-dlp

   # macOS
   brew install yt-dlp
   ```

2. **Database**: Ensure PostgreSQL is running and the database is set up
   ```bash
   # Run database migrations
   alembic upgrade head
   ```

### Dependencies

Install the required Python packages:
```bash
uv pip install -r requirements-youtube.txt
```

## Usage

### CLI Commands

The YouTube service is integrated into the main CLI with the `youtube` command group:

#### Extract Metadata
```bash
# Extract metadata from a YouTube URL
trax youtube extract https://youtube.com/watch?v=dQw4w9WgXcQ

# Force re-extraction even if video exists
trax youtube extract https://youtube.com/watch?v=dQw4w9WgXcQ --force
```

#### List Videos
```bash
# List recent videos (default: 10)
trax youtube list

# List more videos
trax youtube list --limit 20

# Search by title
trax youtube list --search "python tutorial"

# Filter by channel
trax youtube list --channel "Tech Channel"
```

#### Show Video Details
```bash
# Show detailed information for a video
trax youtube show dQw4w9WgXcQ
```

#### Statistics
```bash
# Show YouTube video statistics
trax youtube stats
```

#### Delete Video
```bash
# Delete a video from database
trax youtube delete dQw4w9WgXcQ
```

### Programmatic Usage

```python
import asyncio
from src.services.youtube_service import YouTubeMetadataService
from src.repositories.youtube_repository import YouTubeRepository

async def example():
    # Initialize service
    service = YouTubeMetadataService()
    await service.initialize()

    # Extract and store metadata
    video = await service.extract_and_store_metadata(
        "https://youtube.com/watch?v=dQw4w9WgXcQ"
    )

    print(f"Title: {video.title}")
    print(f"Channel: {video.channel}")
    print(f"Duration: {video.duration_seconds} seconds")

    # Use repository for database operations
    repo = YouTubeRepository()
    videos = await repo.list_all(limit=10)
    stats = await repo.get_statistics()

# Run the example
asyncio.run(example())
```

## Supported URL Formats

The service supports various YouTube URL formats:

- `https://www.youtube.com/watch?v=VIDEO_ID`
- `https://youtu.be/VIDEO_ID`
- `https://www.youtube.com/embed/VIDEO_ID`
- `https://www.youtube.com/v/VIDEO_ID`
- URLs with additional parameters (e.g., `&t=30s`)

## Extracted Metadata

The service extracts the following metadata:

- **YouTube ID**: Unique identifier for the video
- **Title**: Video title
- **Channel**: Uploader/channel name
- **Description**: Video description
- **Duration**: Video length in seconds
- **URL**: Original YouTube URL
- **Metadata Extracted At**: Timestamp of extraction

## Database Schema

The metadata is stored in the `youtube_videos` table:

```sql
CREATE TABLE youtube_videos (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    youtube_id VARCHAR(20) NOT NULL UNIQUE,
    title VARCHAR(500) NOT NULL,
    channel VARCHAR(200) NOT NULL,
    description TEXT,
    duration_seconds INTEGER NOT NULL,
    url VARCHAR(500) NOT NULL,
    metadata_extracted_at TIMESTAMP DEFAULT NOW(),
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);
```

## Architecture

The service follows the protocol-based architecture pattern:

### Components

1. **YouTubeMetadataService**: Main service class
   - Manages the extraction workflow
   - Handles database operations
   - Provides health monitoring

2. **CurlYouTubeExtractor**: Metadata extraction implementation
   - Uses `yt-dlp` for metadata extraction
   - Handles various URL formats
   - Provides error handling

3. **YouTubeRepository**: Database operations
   - CRUD operations for YouTube videos
   - Search and filtering capabilities
   - Statistics generation

### Protocols

- `YouTubeMetadataExtractor`: Protocol for metadata extraction
- `YouTubeRepositoryProtocol`: Protocol for repository operations

This allows for easy testing and swapping implementations.

## Error Handling

The service includes comprehensive error handling:

- **Invalid URLs**: Validates YouTube URL format
- **Network Issues**: Handles connection timeouts
- **yt-dlp Errors**: Captures and logs extraction failures
- **Database Errors**: Handles database connection issues
- **Missing Dependencies**: Checks for required tools

## Testing

Run the tests with:

```bash
# Run all YouTube service tests
uv run pytest tests/test_youtube_service.py -v

# Run specific test class
uv run pytest tests/test_youtube_service.py::TestCurlYouTubeExtractor -v

# Run with coverage
uv run pytest tests/test_youtube_service.py --cov=src.services.youtube_service --cov=src.repositories.youtube_repository
```

## Example Script

Run the example script to see the service in action:

```bash
uv run python examples/youtube_metadata_example.py
```

## Troubleshooting

### Common Issues

1. **yt-dlp not found**
   ```
   Error: yt-dlp not available
   ```
   **Solution**: Install yt-dlp using pip or your system package manager

2. **Database connection error**
   ```
   Error: Could not connect to database
   ```
   **Solution**: Ensure PostgreSQL is running and DATABASE_URL is correct

3. **Video not found**
   ```
   Error: Failed to extract metadata: Video not found
   ```
   **Solution**: Check if the YouTube URL is valid and accessible

4. **Permission denied**
   ```
   Error: Permission denied when running yt-dlp
   ```
   **Solution**: Ensure yt-dlp has execute permissions

### Health Check

Check service health:

```python
service = YouTubeMetadataService()
health = service.get_health_status()
print(health)
```

This will show:
- Service status
- yt-dlp availability
- Cache directory location

## Performance

- **Extraction Time**: ~2-5 seconds per video (depends on network)
- **Database Operations**: <100ms for most operations
- **Memory Usage**: <50MB for typical usage
- **Concurrent Requests**: Limited by yt-dlp and database connections

## Security Considerations

- No API keys required (uses public YouTube data)
- Local caching for performance
- Input validation for URLs
- SQL injection protection via parameterized queries
- No sensitive data stored

## Future Enhancements

- [ ] Batch processing for multiple URLs
- [ ] Caching extracted metadata
- [ ] Support for playlists
- [ ] Video thumbnail extraction
- [ ] Automatic metadata refresh
- [ ] Integration with transcription pipeline