trax/docs/DATABASE.md

9.4 KiB

Database Schema Documentation

Complete reference for Trax PostgreSQL database schema with JSONB support.

Overview

Trax uses PostgreSQL 15+ with the following key features:

  • JSONB columns for flexible metadata storage
  • UUID primary keys for distributed system compatibility
  • Registry pattern to prevent SQLAlchemy "multiple classes" errors
  • Timestamp mixins for automatic created_at/updated_at tracking
  • Version tracking for iterative pipeline results

Core Tables

youtube_videos

Stores YouTube video metadata extracted via curl (no API required).

CREATE TABLE youtube_videos (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    youtube_id VARCHAR(20) NOT NULL UNIQUE,
    title VARCHAR(500) NOT NULL,
    channel VARCHAR(200) NOT NULL,
    description TEXT,
    duration_seconds INTEGER NOT NULL,
    url VARCHAR(500) NOT NULL,
    metadata_extracted_at TIMESTAMP DEFAULT NOW(),
    created_at TIMESTAMP DEFAULT NOW() NOT NULL,
    updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);

CREATE INDEX idx_youtube_videos_youtube_id ON youtube_videos(youtube_id);

Key Features:

  • No API dependency - Metadata extracted via curl scraping
  • Unique constraint on youtube_id prevents duplicates
  • Flexible description storage for any length content

media_files

Central table for all media files requiring transcription.

CREATE TABLE media_files (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    filename VARCHAR(255) NOT NULL,
    file_size BIGINT NOT NULL,
    duration FLOAT,
    mime_type VARCHAR(100),
    source_path TEXT NOT NULL,
    local_path TEXT,
    file_hash VARCHAR(64) UNIQUE,
    file_metadata JSONB DEFAULT '{}',
    status VARCHAR(20) DEFAULT 'pending' NOT NULL,
    youtube_video_id UUID REFERENCES youtube_videos(id),
    created_at TIMESTAMP DEFAULT NOW() NOT NULL,
    updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);

CREATE INDEX idx_media_files_status ON media_files(status);
CREATE INDEX idx_media_files_hash ON media_files(file_hash);
CREATE INDEX idx_media_files_youtube_video_id ON media_files(youtube_video_id);

Status Values:

  • pending - File identified, not yet processed
  • downloading - Currently downloading from source
  • processing - Audio preprocessing in progress
  • ready - Ready for transcription
  • failed - Processing failed

JSONB file_metadata Example:

{
    "format": "mp4",
    "codec": "aac",
    "bitrate": 128000,
    "sample_rate": 44100,
    "channels": 2,
    "ffmpeg_info": {...}
}

transcription_jobs

Tracks individual transcription requests with retry logic.

CREATE TABLE transcription_jobs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    media_file_id UUID NOT NULL REFERENCES media_files(id),
    status VARCHAR(20) DEFAULT 'pending' NOT NULL,
    priority INTEGER DEFAULT 0,
    model_config JSONB DEFAULT '{}',
    processing_options JSONB DEFAULT '{}',
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    processing_time FLOAT,
    error_message TEXT,
    retry_count INTEGER DEFAULT 0,
    max_retries INTEGER DEFAULT 3,
    created_at TIMESTAMP DEFAULT NOW() NOT NULL,
    updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);

CREATE INDEX idx_transcription_jobs_status ON transcription_jobs(status);
CREATE INDEX idx_transcription_jobs_priority ON transcription_jobs(priority);
CREATE INDEX idx_transcription_jobs_media_file_id ON transcription_jobs(media_file_id);

Status Values:

  • pending - Queued for processing
  • processing - Currently being transcribed
  • completed - Successfully completed
  • failed - Failed after max retries

model_config Example:

{
    "model": "distil-large-v3",
    "language": "en",
    "temperature": 0.0,
    "response_format": "verbose_json"
}

transcription_results

Stores actual transcription outputs with versioning support.

CREATE TABLE transcription_results (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    job_id UUID NOT NULL REFERENCES transcription_jobs(id),
    media_file_id UUID NOT NULL REFERENCES media_files(id),
    pipeline_version VARCHAR(10) DEFAULT 'v1' NOT NULL,
    content JSONB NOT NULL,
    segments JSONB,
    confidence_scores JSONB,
    speaker_info JSONB,
    accuracy FLOAT,
    word_count INTEGER,
    processing_time FLOAT,
    model_used VARCHAR(100),
    model_config JSONB,
    parent_result_id UUID REFERENCES transcription_results(id),
    version INTEGER DEFAULT 1 NOT NULL,
    created_at TIMESTAMP DEFAULT NOW() NOT NULL,
    updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);

CREATE INDEX idx_transcription_results_pipeline_version ON transcription_results(pipeline_version);
CREATE INDEX idx_transcription_results_media_file_id ON transcription_results(media_file_id);
CREATE INDEX idx_transcription_results_parent_result_id ON transcription_results(parent_result_id);

Pipeline Versions:

  • v1 - Whisper distil-large-v3 only
  • v2 - Whisper + DeepSeek enhancement
  • v3 - Multi-pass accuracy optimization
  • v4 - Speaker diarization support

content JSONB Example (v1):

{
    "text": "Complete transcript text here...",
    "language": "en",
    "segments": [
        {
            "id": 0,
            "start": 0.0,
            "end": 1.5,
            "text": "Hello world",
            "confidence": 0.95
        }
    ]
}

content JSONB Example (v2 - Enhanced):

{
    "original_text": "Original Whisper output...",
    "enhanced_text": "Enhanced and corrected text...",
    "improvements": [
        "Fixed grammar in sentence 3",
        "Corrected technical terms",
        "Added punctuation"
    ],
    "enhancement_metadata": {
        "model": "deepseek-chat",
        "confidence": 0.98,
        "processing_time_ms": 2500
    }
}

processing_jobs

Tracks batch processing operations.

CREATE TABLE processing_jobs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    job_type VARCHAR(50) NOT NULL,
    status VARCHAR(20) DEFAULT 'pending' NOT NULL,
    config JSONB DEFAULT '{}',
    file_patterns JSONB,
    total_items INTEGER DEFAULT 0,
    processed_items INTEGER DEFAULT 0,
    successful_items INTEGER DEFAULT 0,
    failed_items INTEGER DEFAULT 0,
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    error_message TEXT,
    created_at TIMESTAMP DEFAULT NOW() NOT NULL,
    updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);

CREATE INDEX idx_processing_jobs_type ON processing_jobs(job_type);
CREATE INDEX idx_processing_jobs_status ON processing_jobs(status);

Relationships

youtube_videos (1) ──→ (0..n) media_files
media_files (1) ──→ (0..n) transcription_jobs  
transcription_jobs (1) ──→ (0..n) transcription_results
transcription_results (0..1) ──→ (0..n) transcription_results (parent/child versioning)

Registry Pattern Implementation

Trax uses SQLAlchemy's registry pattern to prevent "multiple classes" errors:

# src/database/__init__.py
from sqlalchemy.orm import declarative_base, registry

mapper_registry = registry()
Base = declarative_base(registry=mapper_registry)

def register_model(cls):
    """Register a model class to prevent multiple registration errors."""
    return cls

# src/database/models.py
@register_model  
class YouTubeVideo(Base, TimestampedMixin):
    __tablename__ = "youtube_videos"
    # ... model definition

JSONB Query Examples

Search transcript content

-- Find transcripts containing specific text
SELECT tr.id, mf.filename, tr.content->>'text'
FROM transcription_results tr
JOIN media_files mf ON tr.media_file_id = mf.id  
WHERE tr.content->>'text' ILIKE '%machine learning%';

-- Find high-confidence segments
SELECT tr.id, segment->>'text', segment->>'confidence'
FROM transcription_results tr,
     jsonb_array_elements(tr.content->'segments') segment
WHERE (segment->>'confidence')::float > 0.95;

Query file metadata

-- Find files by format
SELECT filename, file_metadata->>'format', file_metadata->>'duration'
FROM media_files
WHERE file_metadata->>'format' = 'mp4';

-- Files larger than 100MB
SELECT filename, file_size, file_metadata->>'bitrate'
FROM media_files
WHERE file_size > 104857600;

Performance analytics

-- Average processing time by pipeline version
SELECT 
    pipeline_version,
    AVG(processing_time) as avg_time,
    AVG(accuracy) as avg_accuracy,
    COUNT(*) as total_transcripts
FROM transcription_results
GROUP BY pipeline_version;

-- Failed jobs analysis
SELECT 
    DATE(created_at) as date,
    COUNT(*) as failed_jobs,
    COUNT(DISTINCT media_file_id) as unique_files
FROM transcription_jobs
WHERE status = 'failed'
GROUP BY DATE(created_at)
ORDER BY date DESC;

Migrations

Database schema is managed via Alembic migrations:

# Create new migration
uv run alembic revision -m "Add new feature"

# Apply migrations
uv run alembic upgrade head

# Downgrade one revision
uv run alembic downgrade -1

# Show current revision
uv run alembic current

Performance Considerations

Indexes

  • All foreign keys are indexed
  • Status columns are indexed for filtering
  • JSONB columns use GIN indexes for text search

Partitioning (Future)

For high-volume usage, consider partitioning large tables:

  • transcription_results by created_at (monthly partitions)
  • media_files by status and created_at

Connection Pooling

PostgreSQL connection pooling is configured for optimal performance:

  • Pool size: 20 connections
  • Max overflow: 30 connections
  • Pool timeout: 30 seconds

For complete schema updates, see migration files in migrations/versions/.