# Database Schema Documentation Complete reference for Trax PostgreSQL database schema with JSONB support. ## Overview Trax uses PostgreSQL 15+ with the following key features: - **JSONB columns** for flexible metadata storage - **UUID primary keys** for distributed system compatibility - **Registry pattern** to prevent SQLAlchemy "multiple classes" errors - **Timestamp mixins** for automatic created_at/updated_at tracking - **Version tracking** for iterative pipeline results ## Core Tables ### youtube_videos Stores YouTube video metadata extracted via curl (no API required). ```sql CREATE TABLE youtube_videos ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), youtube_id VARCHAR(20) NOT NULL UNIQUE, title VARCHAR(500) NOT NULL, channel VARCHAR(200) NOT NULL, description TEXT, duration_seconds INTEGER NOT NULL, url VARCHAR(500) NOT NULL, metadata_extracted_at TIMESTAMP DEFAULT NOW(), created_at TIMESTAMP DEFAULT NOW() NOT NULL, updated_at TIMESTAMP DEFAULT NOW() NOT NULL ); CREATE INDEX idx_youtube_videos_youtube_id ON youtube_videos(youtube_id); ``` **Key Features:** - **No API dependency** - Metadata extracted via curl scraping - **Unique constraint** on youtube_id prevents duplicates - **Flexible description** storage for any length content ### media_files Central table for all media files requiring transcription. ```sql CREATE TABLE media_files ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), filename VARCHAR(255) NOT NULL, file_size BIGINT NOT NULL, duration FLOAT, mime_type VARCHAR(100), source_path TEXT NOT NULL, local_path TEXT, file_hash VARCHAR(64) UNIQUE, file_metadata JSONB DEFAULT '{}', status VARCHAR(20) DEFAULT 'pending' NOT NULL, youtube_video_id UUID REFERENCES youtube_videos(id), created_at TIMESTAMP DEFAULT NOW() NOT NULL, updated_at TIMESTAMP DEFAULT NOW() NOT NULL ); CREATE INDEX idx_media_files_status ON media_files(status); CREATE INDEX idx_media_files_hash ON media_files(file_hash); CREATE INDEX idx_media_files_youtube_video_id ON media_files(youtube_video_id); ``` **Status Values:** - `pending` - File identified, not yet processed - `downloading` - Currently downloading from source - `processing` - Audio preprocessing in progress - `ready` - Ready for transcription - `failed` - Processing failed **JSONB file_metadata Example:** ```json { "format": "mp4", "codec": "aac", "bitrate": 128000, "sample_rate": 44100, "channels": 2, "ffmpeg_info": {...} } ``` ### transcription_jobs Tracks individual transcription requests with retry logic. ```sql CREATE TABLE transcription_jobs ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), media_file_id UUID NOT NULL REFERENCES media_files(id), status VARCHAR(20) DEFAULT 'pending' NOT NULL, priority INTEGER DEFAULT 0, model_config JSONB DEFAULT '{}', processing_options JSONB DEFAULT '{}', started_at TIMESTAMP, completed_at TIMESTAMP, processing_time FLOAT, error_message TEXT, retry_count INTEGER DEFAULT 0, max_retries INTEGER DEFAULT 3, created_at TIMESTAMP DEFAULT NOW() NOT NULL, updated_at TIMESTAMP DEFAULT NOW() NOT NULL ); CREATE INDEX idx_transcription_jobs_status ON transcription_jobs(status); CREATE INDEX idx_transcription_jobs_priority ON transcription_jobs(priority); CREATE INDEX idx_transcription_jobs_media_file_id ON transcription_jobs(media_file_id); ``` **Status Values:** - `pending` - Queued for processing - `processing` - Currently being transcribed - `completed` - Successfully completed - `failed` - Failed after max retries **model_config Example:** ```json { "model": "distil-large-v3", "language": "en", "temperature": 0.0, "response_format": "verbose_json" } ``` ### transcription_results Stores actual transcription outputs with versioning support. ```sql CREATE TABLE transcription_results ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), job_id UUID NOT NULL REFERENCES transcription_jobs(id), media_file_id UUID NOT NULL REFERENCES media_files(id), pipeline_version VARCHAR(10) DEFAULT 'v1' NOT NULL, content JSONB NOT NULL, segments JSONB, confidence_scores JSONB, speaker_info JSONB, accuracy FLOAT, word_count INTEGER, processing_time FLOAT, model_used VARCHAR(100), model_config JSONB, parent_result_id UUID REFERENCES transcription_results(id), version INTEGER DEFAULT 1 NOT NULL, created_at TIMESTAMP DEFAULT NOW() NOT NULL, updated_at TIMESTAMP DEFAULT NOW() NOT NULL ); CREATE INDEX idx_transcription_results_pipeline_version ON transcription_results(pipeline_version); CREATE INDEX idx_transcription_results_media_file_id ON transcription_results(media_file_id); CREATE INDEX idx_transcription_results_parent_result_id ON transcription_results(parent_result_id); ``` **Pipeline Versions:** - `v1` - Whisper distil-large-v3 only - `v2` - Whisper + DeepSeek enhancement - `v3` - Multi-pass accuracy optimization - `v4` - Speaker diarization support **content JSONB Example (v1):** ```json { "text": "Complete transcript text here...", "language": "en", "segments": [ { "id": 0, "start": 0.0, "end": 1.5, "text": "Hello world", "confidence": 0.95 } ] } ``` **content JSONB Example (v2 - Enhanced):** ```json { "original_text": "Original Whisper output...", "enhanced_text": "Enhanced and corrected text...", "improvements": [ "Fixed grammar in sentence 3", "Corrected technical terms", "Added punctuation" ], "enhancement_metadata": { "model": "deepseek-chat", "confidence": 0.98, "processing_time_ms": 2500 } } ``` ### processing_jobs Tracks batch processing operations. ```sql CREATE TABLE processing_jobs ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), job_type VARCHAR(50) NOT NULL, status VARCHAR(20) DEFAULT 'pending' NOT NULL, config JSONB DEFAULT '{}', file_patterns JSONB, total_items INTEGER DEFAULT 0, processed_items INTEGER DEFAULT 0, successful_items INTEGER DEFAULT 0, failed_items INTEGER DEFAULT 0, started_at TIMESTAMP, completed_at TIMESTAMP, error_message TEXT, created_at TIMESTAMP DEFAULT NOW() NOT NULL, updated_at TIMESTAMP DEFAULT NOW() NOT NULL ); CREATE INDEX idx_processing_jobs_type ON processing_jobs(job_type); CREATE INDEX idx_processing_jobs_status ON processing_jobs(status); ``` ## Relationships ``` youtube_videos (1) ──→ (0..n) media_files media_files (1) ──→ (0..n) transcription_jobs transcription_jobs (1) ──→ (0..n) transcription_results transcription_results (0..1) ──→ (0..n) transcription_results (parent/child versioning) ``` ## Registry Pattern Implementation Trax uses SQLAlchemy's registry pattern to prevent "multiple classes" errors: ```python # src/database/__init__.py from sqlalchemy.orm import declarative_base, registry mapper_registry = registry() Base = declarative_base(registry=mapper_registry) def register_model(cls): """Register a model class to prevent multiple registration errors.""" return cls # src/database/models.py @register_model class YouTubeVideo(Base, TimestampedMixin): __tablename__ = "youtube_videos" # ... model definition ``` ## JSONB Query Examples ### Search transcript content ```sql -- Find transcripts containing specific text SELECT tr.id, mf.filename, tr.content->>'text' FROM transcription_results tr JOIN media_files mf ON tr.media_file_id = mf.id WHERE tr.content->>'text' ILIKE '%machine learning%'; -- Find high-confidence segments SELECT tr.id, segment->>'text', segment->>'confidence' FROM transcription_results tr, jsonb_array_elements(tr.content->'segments') segment WHERE (segment->>'confidence')::float > 0.95; ``` ### Query file metadata ```sql -- Find files by format SELECT filename, file_metadata->>'format', file_metadata->>'duration' FROM media_files WHERE file_metadata->>'format' = 'mp4'; -- Files larger than 100MB SELECT filename, file_size, file_metadata->>'bitrate' FROM media_files WHERE file_size > 104857600; ``` ### Performance analytics ```sql -- Average processing time by pipeline version SELECT pipeline_version, AVG(processing_time) as avg_time, AVG(accuracy) as avg_accuracy, COUNT(*) as total_transcripts FROM transcription_results GROUP BY pipeline_version; -- Failed jobs analysis SELECT DATE(created_at) as date, COUNT(*) as failed_jobs, COUNT(DISTINCT media_file_id) as unique_files FROM transcription_jobs WHERE status = 'failed' GROUP BY DATE(created_at) ORDER BY date DESC; ``` ## Migrations Database schema is managed via Alembic migrations: ```bash # Create new migration uv run alembic revision -m "Add new feature" # Apply migrations uv run alembic upgrade head # Downgrade one revision uv run alembic downgrade -1 # Show current revision uv run alembic current ``` ## Performance Considerations ### Indexes - All foreign keys are indexed - Status columns are indexed for filtering - JSONB columns use GIN indexes for text search ### Partitioning (Future) For high-volume usage, consider partitioning large tables: - `transcription_results` by `created_at` (monthly partitions) - `media_files` by `status` and `created_at` ### Connection Pooling PostgreSQL connection pooling is configured for optimal performance: - Pool size: 20 connections - Max overflow: 30 connections - Pool timeout: 30 seconds For complete schema updates, see migration files in `migrations/versions/`.