9.4 KiB

Raw Permalink Blame History

Database Schema Documentation

Complete reference for Trax PostgreSQL database schema with JSONB support.

Overview

Trax uses PostgreSQL 15+ with the following key features:

JSONB columns for flexible metadata storage
UUID primary keys for distributed system compatibility
Registry pattern to prevent SQLAlchemy "multiple classes" errors
Timestamp mixins for automatic created_at/updated_at tracking
Version tracking for iterative pipeline results

Core Tables

youtube_videos

Stores YouTube video metadata extracted via curl (no API required).

CREATE TABLE youtube_videos (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    youtube_id VARCHAR(20) NOT NULL UNIQUE,
    title VARCHAR(500) NOT NULL,
    channel VARCHAR(200) NOT NULL,
    description TEXT,
    duration_seconds INTEGER NOT NULL,
    url VARCHAR(500) NOT NULL,
    metadata_extracted_at TIMESTAMP DEFAULT NOW(),
    created_at TIMESTAMP DEFAULT NOW() NOT NULL,
    updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);

CREATE INDEX idx_youtube_videos_youtube_id ON youtube_videos(youtube_id);

Key Features:

No API dependency - Metadata extracted via curl scraping
Unique constraint on youtube_id prevents duplicates
Flexible description storage for any length content

media_files

Central table for all media files requiring transcription.

CREATE TABLE media_files (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    filename VARCHAR(255) NOT NULL,
    file_size BIGINT NOT NULL,
    duration FLOAT,
    mime_type VARCHAR(100),
    source_path TEXT NOT NULL,
    local_path TEXT,
    file_hash VARCHAR(64) UNIQUE,
    file_metadata JSONB DEFAULT '{}',
    status VARCHAR(20) DEFAULT 'pending' NOT NULL,
    youtube_video_id UUID REFERENCES youtube_videos(id),
    created_at TIMESTAMP DEFAULT NOW() NOT NULL,
    updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);

CREATE INDEX idx_media_files_status ON media_files(status);
CREATE INDEX idx_media_files_hash ON media_files(file_hash);
CREATE INDEX idx_media_files_youtube_video_id ON media_files(youtube_video_id);

Status Values:

pending - File identified, not yet processed
downloading - Currently downloading from source
processing - Audio preprocessing in progress
ready - Ready for transcription
failed - Processing failed

JSONB file_metadata Example:

{
    "format": "mp4",
    "codec": "aac",
    "bitrate": 128000,
    "sample_rate": 44100,
    "channels": 2,
    "ffmpeg_info": {...}
}

transcription_jobs

Tracks individual transcription requests with retry logic.

CREATE TABLE transcription_jobs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    media_file_id UUID NOT NULL REFERENCES media_files(id),
    status VARCHAR(20) DEFAULT 'pending' NOT NULL,
    priority INTEGER DEFAULT 0,
    model_config JSONB DEFAULT '{}',
    processing_options JSONB DEFAULT '{}',
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    processing_time FLOAT,
    error_message TEXT,
    retry_count INTEGER DEFAULT 0,
    max_retries INTEGER DEFAULT 3,
    created_at TIMESTAMP DEFAULT NOW() NOT NULL,
    updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);

CREATE INDEX idx_transcription_jobs_status ON transcription_jobs(status);
CREATE INDEX idx_transcription_jobs_priority ON transcription_jobs(priority);
CREATE INDEX idx_transcription_jobs_media_file_id ON transcription_jobs(media_file_id);

Status Values:

pending - Queued for processing
processing - Currently being transcribed
completed - Successfully completed
failed - Failed after max retries

model_config Example:

{
    "model": "distil-large-v3",
    "language": "en",
    "temperature": 0.0,
    "response_format": "verbose_json"
}

transcription_results

Stores actual transcription outputs with versioning support.

CREATE TABLE transcription_results (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    job_id UUID NOT NULL REFERENCES transcription_jobs(id),
    media_file_id UUID NOT NULL REFERENCES media_files(id),
    pipeline_version VARCHAR(10) DEFAULT 'v1' NOT NULL,
    content JSONB NOT NULL,
    segments JSONB,
    confidence_scores JSONB,
    speaker_info JSONB,
    accuracy FLOAT,
    word_count INTEGER,
    processing_time FLOAT,
    model_used VARCHAR(100),
    model_config JSONB,
    parent_result_id UUID REFERENCES transcription_results(id),
    version INTEGER DEFAULT 1 NOT NULL,
    created_at TIMESTAMP DEFAULT NOW() NOT NULL,
    updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);

CREATE INDEX idx_transcription_results_pipeline_version ON transcription_results(pipeline_version);
CREATE INDEX idx_transcription_results_media_file_id ON transcription_results(media_file_id);
CREATE INDEX idx_transcription_results_parent_result_id ON transcription_results(parent_result_id);

Pipeline Versions:

v1 - Whisper distil-large-v3 only
v2 - Whisper + DeepSeek enhancement
v3 - Multi-pass accuracy optimization
v4 - Speaker diarization support

content JSONB Example (v1):

{
    "text": "Complete transcript text here...",
    "language": "en",
    "segments": [
        {
            "id": 0,
            "start": 0.0,
            "end": 1.5,
            "text": "Hello world",
            "confidence": 0.95
        }
    ]
}

content JSONB Example (v2 - Enhanced):

{
    "original_text": "Original Whisper output...",
    "enhanced_text": "Enhanced and corrected text...",
    "improvements": [
        "Fixed grammar in sentence 3",
        "Corrected technical terms",
        "Added punctuation"
    ],
    "enhancement_metadata": {
        "model": "deepseek-chat",
        "confidence": 0.98,
        "processing_time_ms": 2500
    }
}

processing_jobs

Tracks batch processing operations.

CREATE TABLE processing_jobs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    job_type VARCHAR(50) NOT NULL,
    status VARCHAR(20) DEFAULT 'pending' NOT NULL,
    config JSONB DEFAULT '{}',
    file_patterns JSONB,
    total_items INTEGER DEFAULT 0,
    processed_items INTEGER DEFAULT 0,
    successful_items INTEGER DEFAULT 0,
    failed_items INTEGER DEFAULT 0,
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    error_message TEXT,
    created_at TIMESTAMP DEFAULT NOW() NOT NULL,
    updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);

CREATE INDEX idx_processing_jobs_type ON processing_jobs(job_type);
CREATE INDEX idx_processing_jobs_status ON processing_jobs(status);

Relationships

youtube_videos (1) ──→ (0..n) media_files
media_files (1) ──→ (0..n) transcription_jobs  
transcription_jobs (1) ──→ (0..n) transcription_results
transcription_results (0..1) ──→ (0..n) transcription_results (parent/child versioning)

Registry Pattern Implementation

Trax uses SQLAlchemy's registry pattern to prevent "multiple classes" errors:

# src/database/__init__.py
from sqlalchemy.orm import declarative_base, registry

mapper_registry = registry()
Base = declarative_base(registry=mapper_registry)

def register_model(cls):
    """Register a model class to prevent multiple registration errors."""
    return cls

# src/database/models.py
@register_model  
class YouTubeVideo(Base, TimestampedMixin):
    __tablename__ = "youtube_videos"
    # ... model definition

JSONB Query Examples

Search transcript content

-- Find transcripts containing specific text
SELECT tr.id, mf.filename, tr.content->>'text'
FROM transcription_results tr
JOIN media_files mf ON tr.media_file_id = mf.id  
WHERE tr.content->>'text' ILIKE '%machine learning%';

-- Find high-confidence segments
SELECT tr.id, segment->>'text', segment->>'confidence'
FROM transcription_results tr,
     jsonb_array_elements(tr.content->'segments') segment
WHERE (segment->>'confidence')::float > 0.95;

Query file metadata

-- Find files by format
SELECT filename, file_metadata->>'format', file_metadata->>'duration'
FROM media_files
WHERE file_metadata->>'format' = 'mp4';

-- Files larger than 100MB
SELECT filename, file_size, file_metadata->>'bitrate'
FROM media_files
WHERE file_size > 104857600;

Performance analytics

-- Average processing time by pipeline version
SELECT 
    pipeline_version,
    AVG(processing_time) as avg_time,
    AVG(accuracy) as avg_accuracy,
    COUNT(*) as total_transcripts
FROM transcription_results
GROUP BY pipeline_version;

-- Failed jobs analysis
SELECT 
    DATE(created_at) as date,
    COUNT(*) as failed_jobs,
    COUNT(DISTINCT media_file_id) as unique_files
FROM transcription_jobs
WHERE status = 'failed'
GROUP BY DATE(created_at)
ORDER BY date DESC;

Migrations

Database schema is managed via Alembic migrations:

# Create new migration
uv run alembic revision -m "Add new feature"

# Apply migrations
uv run alembic upgrade head

# Downgrade one revision
uv run alembic downgrade -1

# Show current revision
uv run alembic current

Performance Considerations

Indexes

All foreign keys are indexed
Status columns are indexed for filtering
JSONB columns use GIN indexes for text search

Partitioning (Future)

For high-volume usage, consider partitioning large tables:

transcription_results by created_at (monthly partitions)
media_files by status and created_at

Connection Pooling

PostgreSQL connection pooling is configured for optimal performance:

Pool size: 20 connections
Max overflow: 30 connections
Pool timeout: 30 seconds

For complete schema updates, see migration files in migrations/versions/.

9.4 KiB Raw Permalink Blame History