trax/docs/DATABASE.md

# Database Schema Documentation

Complete reference for Trax PostgreSQL database schema with JSONB support.

## Overview

Trax uses PostgreSQL 15+ with the following key features:
- **JSONB columns** for flexible metadata storage
- **UUID primary keys** for distributed system compatibility
- **Registry pattern** to prevent SQLAlchemy "multiple classes" errors
- **Timestamp mixins** for automatic created_at/updated_at tracking
- **Version tracking** for iterative pipeline results

## Core Tables

### youtube_videos

Stores YouTube video metadata extracted via curl (no API required).

```sql
CREATE TABLE youtube_videos (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    youtube_id VARCHAR(20) NOT NULL UNIQUE,
    title VARCHAR(500) NOT NULL,
    channel VARCHAR(200) NOT NULL,
    description TEXT,
    duration_seconds INTEGER NOT NULL,
    url VARCHAR(500) NOT NULL,
    metadata_extracted_at TIMESTAMP DEFAULT NOW(),
    created_at TIMESTAMP DEFAULT NOW() NOT NULL,
    updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);

CREATE INDEX idx_youtube_videos_youtube_id ON youtube_videos(youtube_id);
```

**Key Features:**
- **No API dependency** - Metadata extracted via curl scraping
- **Unique constraint** on youtube_id prevents duplicates
- **Flexible description** storage for any length content

### media_files

Central table for all media files requiring transcription.

```sql
CREATE TABLE media_files (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    filename VARCHAR(255) NOT NULL,
    file_size BIGINT NOT NULL,
    duration FLOAT,
    mime_type VARCHAR(100),
    source_path TEXT NOT NULL,
    local_path TEXT,
    file_hash VARCHAR(64) UNIQUE,
    file_metadata JSONB DEFAULT '{}',
    status VARCHAR(20) DEFAULT 'pending' NOT NULL,
    youtube_video_id UUID REFERENCES youtube_videos(id),
    created_at TIMESTAMP DEFAULT NOW() NOT NULL,
    updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);

CREATE INDEX idx_media_files_status ON media_files(status);
CREATE INDEX idx_media_files_hash ON media_files(file_hash);
CREATE INDEX idx_media_files_youtube_video_id ON media_files(youtube_video_id);
```

**Status Values:**
- `pending` - File identified, not yet processed
- `downloading` - Currently downloading from source
- `processing` - Audio preprocessing in progress
- `ready` - Ready for transcription
- `failed` - Processing failed

**JSONB file_metadata Example:**
```json
{
    "format": "mp4",
    "codec": "aac",
    "bitrate": 128000,
    "sample_rate": 44100,
    "channels": 2,
    "ffmpeg_info": {...}
}
```

### transcription_jobs

Tracks individual transcription requests with retry logic.

```sql
CREATE TABLE transcription_jobs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    media_file_id UUID NOT NULL REFERENCES media_files(id),
    status VARCHAR(20) DEFAULT 'pending' NOT NULL,
    priority INTEGER DEFAULT 0,
    model_config JSONB DEFAULT '{}',
    processing_options JSONB DEFAULT '{}',
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    processing_time FLOAT,
    error_message TEXT,
    retry_count INTEGER DEFAULT 0,
    max_retries INTEGER DEFAULT 3,
    created_at TIMESTAMP DEFAULT NOW() NOT NULL,
    updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);

CREATE INDEX idx_transcription_jobs_status ON transcription_jobs(status);
CREATE INDEX idx_transcription_jobs_priority ON transcription_jobs(priority);
CREATE INDEX idx_transcription_jobs_media_file_id ON transcription_jobs(media_file_id);
```

**Status Values:**
- `pending` - Queued for processing
- `processing` - Currently being transcribed
- `completed` - Successfully completed
- `failed` - Failed after max retries

**model_config Example:**
```json
{
    "model": "distil-large-v3",
    "language": "en",
    "temperature": 0.0,
    "response_format": "verbose_json"
}
```

### transcription_results

Stores actual transcription outputs with versioning support.

```sql
CREATE TABLE transcription_results (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    job_id UUID NOT NULL REFERENCES transcription_jobs(id),
    media_file_id UUID NOT NULL REFERENCES media_files(id),
    pipeline_version VARCHAR(10) DEFAULT 'v1' NOT NULL,
    content JSONB NOT NULL,
    segments JSONB,
    confidence_scores JSONB,
    speaker_info JSONB,
    accuracy FLOAT,
    word_count INTEGER,
    processing_time FLOAT,
    model_used VARCHAR(100),
    model_config JSONB,
    parent_result_id UUID REFERENCES transcription_results(id),
    version INTEGER DEFAULT 1 NOT NULL,
    created_at TIMESTAMP DEFAULT NOW() NOT NULL,
    updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);

CREATE INDEX idx_transcription_results_pipeline_version ON transcription_results(pipeline_version);
CREATE INDEX idx_transcription_results_media_file_id ON transcription_results(media_file_id);
CREATE INDEX idx_transcription_results_parent_result_id ON transcription_results(parent_result_id);
```

**Pipeline Versions:**
- `v1` - Whisper distil-large-v3 only
- `v2` - Whisper + DeepSeek enhancement
- `v3` - Multi-pass accuracy optimization
- `v4` - Speaker diarization support

**content JSONB Example (v1):**
```json
{
    "text": "Complete transcript text here...",
    "language": "en",
    "segments": [
        {
            "id": 0,
            "start": 0.0,
            "end": 1.5,
            "text": "Hello world",
            "confidence": 0.95
        }
    ]
}
```

**content JSONB Example (v2 - Enhanced):**
```json
{
    "original_text": "Original Whisper output...",
    "enhanced_text": "Enhanced and corrected text...",
    "improvements": [
        "Fixed grammar in sentence 3",
        "Corrected technical terms",
        "Added punctuation"
    ],
    "enhancement_metadata": {
        "model": "deepseek-chat",
        "confidence": 0.98,
        "processing_time_ms": 2500
    }
}
```

### processing_jobs

Tracks batch processing operations.

```sql
CREATE TABLE processing_jobs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    job_type VARCHAR(50) NOT NULL,
    status VARCHAR(20) DEFAULT 'pending' NOT NULL,
    config JSONB DEFAULT '{}',
    file_patterns JSONB,
    total_items INTEGER DEFAULT 0,
    processed_items INTEGER DEFAULT 0,
    successful_items INTEGER DEFAULT 0,
    failed_items INTEGER DEFAULT 0,
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    error_message TEXT,
    created_at TIMESTAMP DEFAULT NOW() NOT NULL,
    updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);

CREATE INDEX idx_processing_jobs_type ON processing_jobs(job_type);
CREATE INDEX idx_processing_jobs_status ON processing_jobs(status);
```

## Relationships

```
youtube_videos (1) ──→ (0..n) media_files
media_files (1) ──→ (0..n) transcription_jobs
transcription_jobs (1) ──→ (0..n) transcription_results
transcription_results (0..1) ──→ (0..n) transcription_results (parent/child versioning)
```

## Registry Pattern Implementation

Trax uses SQLAlchemy's registry pattern to prevent "multiple classes" errors:

```python
# src/database/__init__.py
from sqlalchemy.orm import declarative_base, registry

mapper_registry = registry()
Base = declarative_base(registry=mapper_registry)

def register_model(cls):
    """Register a model class to prevent multiple registration errors."""
    return cls

# src/database/models.py
@register_model
class YouTubeVideo(Base, TimestampedMixin):
    __tablename__ = "youtube_videos"
    # ... model definition
```

## JSONB Query Examples

### Search transcript content
```sql
-- Find transcripts containing specific text
SELECT tr.id, mf.filename, tr.content->>'text'
FROM transcription_results tr
JOIN media_files mf ON tr.media_file_id = mf.id
WHERE tr.content->>'text' ILIKE '%machine learning%';

-- Find high-confidence segments
SELECT tr.id, segment->>'text', segment->>'confidence'
FROM transcription_results tr,
     jsonb_array_elements(tr.content->'segments') segment
WHERE (segment->>'confidence')::float > 0.95;
```

### Query file metadata
```sql
-- Find files by format
SELECT filename, file_metadata->>'format', file_metadata->>'duration'
FROM media_files
WHERE file_metadata->>'format' = 'mp4';

-- Files larger than 100MB
SELECT filename, file_size, file_metadata->>'bitrate'
FROM media_files
WHERE file_size > 104857600;
```

### Performance analytics
```sql
-- Average processing time by pipeline version
SELECT
    pipeline_version,
    AVG(processing_time) as avg_time,
    AVG(accuracy) as avg_accuracy,
    COUNT(*) as total_transcripts
FROM transcription_results
GROUP BY pipeline_version;

-- Failed jobs analysis
SELECT
    DATE(created_at) as date,
    COUNT(*) as failed_jobs,
    COUNT(DISTINCT media_file_id) as unique_files
FROM transcription_jobs
WHERE status = 'failed'
GROUP BY DATE(created_at)
ORDER BY date DESC;
```

## Migrations

Database schema is managed via Alembic migrations:

```bash
# Create new migration
uv run alembic revision -m "Add new feature"

# Apply migrations
uv run alembic upgrade head

# Downgrade one revision
uv run alembic downgrade -1

# Show current revision
uv run alembic current
```

## Performance Considerations

### Indexes
- All foreign keys are indexed
- Status columns are indexed for filtering
- JSONB columns use GIN indexes for text search

### Partitioning (Future)
For high-volume usage, consider partitioning large tables:
- `transcription_results` by `created_at` (monthly partitions)
- `media_files` by `status` and `created_at`

### Connection Pooling
PostgreSQL connection pooling is configured for optimal performance:
- Pool size: 20 connections
- Max overflow: 30 connections
- Pool timeout: 30 seconds

For complete schema updates, see migration files in `migrations/versions/`.