trax/DB-SCHEMA.md

17 KiB

Database Schema & Migration History

This document tracks all database schema changes, migrations, and evolution for the Trax Media Processing Platform.

Database Overview

  • Database: PostgreSQL with JSONB support
  • ORM: SQLAlchemy 2.0+ with async support
  • Migrations: Alembic for schema versioning
  • Registry Pattern: Prevents SQLAlchemy "multiple classes" errors

Current Schema (Version 2.0.0)

Core Tables

-- YouTube videos metadata
CREATE TABLE youtube_videos (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    youtube_id VARCHAR(20) UNIQUE NOT NULL,
    title TEXT NOT NULL,
    channel TEXT NOT NULL,
    description TEXT,
    duration_seconds INTEGER NOT NULL,
    url TEXT NOT NULL,
    metadata_extracted_at TIMESTAMP DEFAULT NOW(),
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

-- Media files (downloaded from YouTube or local)
CREATE TABLE media_files (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    youtube_video_id UUID REFERENCES youtube_videos(id),
    file_path TEXT NOT NULL,
    file_name TEXT NOT NULL,
    file_size BIGINT NOT NULL,
    duration_seconds FLOAT,
    format_info JSONB,
    download_status VARCHAR(20) DEFAULT 'pending', -- pending, downloading, completed, failed
    download_error TEXT,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

-- Transcription results (Enhanced for v2)
CREATE TABLE transcription_results (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    media_file_id UUID REFERENCES media_files(id),
    version VARCHAR(10) NOT NULL DEFAULT 'v1', -- v1, v2, v3, v4
    text_content TEXT,
    enhanced_content TEXT,
    segments JSONB, -- Whisper segments with timestamps
    raw_content JSONB, -- Complete Whisper API response
    accuracy FLOAT,
    processing_time_seconds FLOAT,
    quality_warnings TEXT[],
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW(),
    
    -- V2-specific columns (nullable for backward compatibility)
    pipeline_version VARCHAR(20), -- v1, v2, v3, v4
    enhanced_content_v2 JSONB, -- Enhanced transcription content
    diarization_content JSONB, -- Speaker diarization data
    merged_content JSONB, -- Merged content from multiple sources
    domain_used VARCHAR(100), -- Domain-specific processing
    accuracy_estimate FLOAT, -- Estimated accuracy for v2
    speaker_count INTEGER, -- Number of speakers detected
    quality_warnings_v2 JSONB, -- Quality warnings and issues
    processing_metadata JSONB -- Additional processing metadata
);

-- Speaker profiles (NEW in v2)
CREATE TABLE speaker_profiles (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    characteristics JSONB, -- Voice characteristics
    embedding TEXT, -- Speaker embedding (base64 encoded)
    sample_count INTEGER DEFAULT 0, -- Number of samples for this speaker
    user_id INTEGER -- Associated user (optional)
);

-- V2 Processing jobs (NEW in v2)
CREATE TABLE v2_processing_jobs (
    id SERIAL PRIMARY KEY,
    status VARCHAR(50) NOT NULL DEFAULT 'pending',
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    completed_at TIMESTAMP WITH TIME ZONE,
    transcript_id UUID REFERENCES transcription_results(id) ON DELETE CASCADE,
    job_type VARCHAR(50) NOT NULL, -- enhancement, diarization, etc.
    parameters JSONB, -- Job parameters
    progress FLOAT DEFAULT 0, -- Progress percentage (0.0 to 1.0)
    error_message TEXT, -- Error message if failed
    result_data JSONB -- Job result data
);

-- Batch processing jobs
CREATE TABLE batch_jobs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name TEXT NOT NULL,
    status VARCHAR(20) NOT NULL DEFAULT 'pending', -- pending, running, completed, failed, paused
    total_files INTEGER NOT NULL,
    processed_files INTEGER DEFAULT 0,
    failed_files INTEGER DEFAULT 0,
    worker_count INTEGER DEFAULT 8,
    memory_limit_mb INTEGER DEFAULT 2048,
    cpu_limit_percent INTEGER DEFAULT 90,
    start_time TIMESTAMP,
    end_time TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

-- Batch processing items
CREATE TABLE batch_items (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    batch_job_id UUID REFERENCES batch_jobs(id),
    task_type VARCHAR(20) NOT NULL, -- transcribe, enhance, youtube, download, preprocess
    task_data JSONB NOT NULL, -- Task-specific data
    priority INTEGER DEFAULT 0,
    status VARCHAR(20) NOT NULL DEFAULT 'pending', -- pending, running, completed, failed, retrying
    retry_count INTEGER DEFAULT 0,
    max_retries INTEGER DEFAULT 3,
    error_message TEXT,
    result_data JSONB, -- Task result data
    processing_time_seconds FLOAT,
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

-- Audio processing metadata
CREATE TABLE audio_processing_metadata (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    media_file_id UUID REFERENCES media_files(id),
    original_format VARCHAR(20),
    original_sample_rate INTEGER,
    original_channels INTEGER,
    processed_format VARCHAR(20) DEFAULT 'wav',
    processed_sample_rate INTEGER DEFAULT 16000,
    processed_channels INTEGER DEFAULT 1,
    preprocessing_time_seconds FLOAT,
    chunk_count INTEGER,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Export history
CREATE TABLE exports (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    transcript_id UUID REFERENCES transcripts(id),
    format VARCHAR(10) NOT NULL, -- 'json', 'txt', 'srt'
    file_path TEXT NOT NULL,
    file_size BIGINT,
    created_at TIMESTAMP DEFAULT NOW()
);

Schema Evolution Timeline

Version 2.0.0 - V2 Schema Migration (2024-12-30)

Status: COMPLETED

New Features:

  • Speaker profiles for speaker diarization and identification
  • V2 processing jobs for individual transcript processing
  • Enhanced transcription results with v2-specific columns
  • Backward compatibility layer for v1 clients
  • Comprehensive data migration utilities

Schema Changes:

-- New tables for v2 features
CREATE TABLE speaker_profiles (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    characteristics JSONB,
    embedding TEXT,
    sample_count INTEGER DEFAULT 0,
    user_id INTEGER
);

CREATE TABLE v2_processing_jobs (
    id SERIAL PRIMARY KEY,
    status VARCHAR(50) NOT NULL DEFAULT 'pending',
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    completed_at TIMESTAMP WITH TIME ZONE,
    transcript_id UUID REFERENCES transcription_results(id) ON DELETE CASCADE,
    job_type VARCHAR(50) NOT NULL,
    parameters JSONB,
    progress FLOAT DEFAULT 0,
    error_message TEXT,
    result_data JSONB
);

-- Enhanced transcription_results table with v2 columns
ALTER TABLE transcription_results
ADD COLUMN pipeline_version VARCHAR(20),
ADD COLUMN enhanced_content_v2 JSONB,
ADD COLUMN diarization_content JSONB,
ADD COLUMN merged_content JSONB,
ADD COLUMN domain_used VARCHAR(100),
ADD COLUMN accuracy_estimate FLOAT,
ADD COLUMN speaker_count INTEGER,
ADD COLUMN quality_warnings_v2 JSONB,
ADD COLUMN processing_metadata JSONB;

Key Improvements:

  • Backward Compatibility: All v2 columns are nullable, allowing v1 clients to continue working
  • Speaker Diarization Support: Speaker profiles and diarization content storage
  • Enhanced Processing: Individual processing jobs with progress tracking
  • Domain-Specific Processing: Support for different processing domains
  • Quality Metrics: Enhanced accuracy estimation and quality warnings
  • Metadata Tracking: Comprehensive processing metadata storage

Version 0.2.0 - Batch Processing System (2024-12-30)

Status: COMPLETED

New Features:

  • Batch processing jobs and items tracking
  • Task type support (transcribe, enhance, youtube, download, preprocess)
  • Priority-based task processing
  • Retry mechanism with configurable limits
  • Processing time and error tracking
  • Resource monitoring integration

Schema Changes:

-- Added batch processing tables
CREATE TABLE batch_jobs (...);
CREATE TABLE batch_items (...);

-- Enhanced existing tables
ALTER TABLE batch_items ADD COLUMN task_type VARCHAR(20) NOT NULL;
ALTER TABLE batch_items ADD COLUMN task_data JSONB NOT NULL;
ALTER TABLE batch_items ADD COLUMN priority INTEGER DEFAULT 0;
ALTER TABLE batch_items ADD COLUMN retry_count INTEGER DEFAULT 0;
ALTER TABLE batch_items ADD COLUMN max_retries INTEGER DEFAULT 3;
ALTER TABLE batch_items ADD COLUMN result_data JSONB;

Version 0.1.1 - AI Enhancement Service (2024-12-25)

Status: COMPLETED

New Features:

  • Enhanced content storage in transcripts table
  • Quality validation and accuracy tracking
  • Processing time and quality warnings
  • Caching support for enhancement results

Schema Changes:

-- Enhanced transcripts table
ALTER TABLE transcripts ADD COLUMN enhanced_content TEXT;
ALTER TABLE transcripts ADD COLUMN accuracy FLOAT;
ALTER TABLE transcripts ADD COLUMN quality_warnings TEXT[];

Version 0.1.0 - Core Services (2024-12-19)

Status: COMPLETED

Initial Schema:

  • YouTube videos metadata extraction
  • Media files download and storage
  • Basic transcription with Whisper API
  • Audio processing metadata tracking
  • Export functionality

Core Tables Created:

  • youtube_videos - YouTube metadata storage
  • media_files - Downloaded media file tracking
  • transcripts - Transcription results storage
  • audio_processing_metadata - Audio processing details
  • exports - Export history tracking

Migration History

Migration 004 - V2 Schema Migration (2024-12-30)

Status: COMPLETED

Implementation Notes:

  • Test Database Setup: Required explicit creation of trax_test database
  • Foreign Key Dependencies: Tests must create records in dependency order (media_files → transcription_jobs → transcription_results)
  • Schema Validation: All 15 schema tests pass, validating v2 structure
  • Backward Compatibility: v1 clients can continue working with v2 schema
  • Helper Methods: Created _create_test_transcript() helper for test data creation

Lessons for Next Time:

  1. Test Database Isolation: Always create separate test database before running schema tests
  2. Dependency Order: When testing foreign key relationships, create parent records first
  3. Schema Matching: Ensure test expectations match actual database schema (column types, nullability)
  4. Helper Functions: Create reusable test helpers for complex data setup
  5. Migration Testing: Test both upgrade and downgrade paths for migrations
-- Create speaker_profiles table
CREATE TABLE speaker_profiles (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    characteristics JSONB,
    embedding TEXT,
    sample_count INTEGER DEFAULT 0,
    user_id INTEGER
);

-- Create v2_processing_jobs table
CREATE TABLE v2_processing_jobs (
    id SERIAL PRIMARY KEY,
    status VARCHAR(50) NOT NULL DEFAULT 'pending',
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    completed_at TIMESTAMP WITH TIME ZONE,
    transcript_id UUID REFERENCES transcription_results(id) ON DELETE CASCADE,
    job_type VARCHAR(50) NOT NULL,
    parameters JSONB,
    progress FLOAT DEFAULT 0,
    error_message TEXT,
    result_data JSONB
);

-- Add v2 columns to transcription_results table
ALTER TABLE transcription_results
ADD COLUMN pipeline_version VARCHAR(20),
ADD COLUMN enhanced_content_v2 JSONB,
ADD COLUMN diarization_content JSONB,
ADD COLUMN merged_content JSONB,
ADD COLUMN domain_used VARCHAR(100),
ADD COLUMN accuracy_estimate FLOAT,
ADD COLUMN speaker_count INTEGER,
ADD COLUMN quality_warnings_v2 JSONB,
ADD COLUMN processing_metadata JSONB;

-- Create indexes for performance
CREATE INDEX ix_speaker_profiles_name ON speaker_profiles(name);
CREATE INDEX ix_speaker_profiles_user_id ON speaker_profiles(user_id);
CREATE INDEX ix_v2_processing_jobs_status ON v2_processing_jobs(status);
CREATE INDEX ix_v2_processing_jobs_transcript_id ON v2_processing_jobs(transcript_id);
CREATE INDEX ix_v2_processing_jobs_job_type ON v2_processing_jobs(job_type);
CREATE INDEX ix_transcription_results_pipeline_version ON transcription_results(pipeline_version);
CREATE INDEX ix_transcription_results_domain_used ON transcription_results(domain_used);
CREATE INDEX ix_transcription_results_speaker_count ON transcription_results(speaker_count);

-- Update existing transcripts to v1
UPDATE transcription_results SET pipeline_version = 'v1' WHERE pipeline_version IS NULL;

Migration 003 - Batch Processing Support (2024-12-30)

-- Add batch processing tables
CREATE TABLE batch_jobs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name TEXT NOT NULL,
    status VARCHAR(20) NOT NULL DEFAULT 'pending',
    total_files INTEGER NOT NULL,
    processed_files INTEGER DEFAULT 0,
    failed_files INTEGER DEFAULT 0,
    worker_count INTEGER DEFAULT 8,
    memory_limit_mb INTEGER DEFAULT 2048,
    cpu_limit_percent INTEGER DEFAULT 90,
    start_time TIMESTAMP,
    end_time TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE batch_items (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    batch_job_id UUID REFERENCES batch_jobs(id),
    task_type VARCHAR(20) NOT NULL,
    task_data JSONB NOT NULL,
    priority INTEGER DEFAULT 0,
    status VARCHAR(20) NOT NULL DEFAULT 'pending',
    retry_count INTEGER DEFAULT 0,
    max_retries INTEGER DEFAULT 3,
    error_message TEXT,
    result_data JSONB,
    processing_time_seconds FLOAT,
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

Migration 002 - AI Enhancement Support (2024-12-25)

-- Add enhancement support to transcripts
ALTER TABLE transcripts ADD COLUMN enhanced_content TEXT;
ALTER TABLE transcripts ADD COLUMN accuracy FLOAT;
ALTER TABLE transcripts ADD COLUMN quality_warnings TEXT[];

Migration 001 - Initial Schema (2024-12-19)

-- Create core tables
CREATE TABLE youtube_videos (...);
CREATE TABLE media_files (...);
CREATE TABLE transcripts (...);
CREATE TABLE audio_processing_metadata (...);
CREATE TABLE exports (...);

Data Relationships

Core Relationships

youtube_videos (1) ←→ (many) media_files
media_files (1) ←→ (many) transcription_results
transcription_results (1) ←→ (many) exports
transcription_results (1) ←→ (many) v2_processing_jobs
batch_jobs (1) ←→ (many) batch_items
media_files (1) ←→ (1) audio_processing_metadata

V2 Relationships

speaker_profiles (many) ←→ (1) users (future)
transcription_results (1) ←→ (many) v2_processing_jobs
transcription_results (1) ←→ (many) speaker_profiles (via diarization_content)

Task Processing Flow

batch_jobs → batch_items → task processing → results
v2_processing_jobs → individual processing → enhanced results

Indexes and Performance

Primary Indexes

  • All tables have UUID primary keys with gen_random_uuid() default
  • Foreign key relationships are properly indexed
  • JSONB columns support efficient querying

V2 Performance Optimizations

  • Speaker profile name and user_id indexes for quick lookups
  • Processing job status and type indexes for job management
  • Transcript pipeline version and domain indexes for filtering
  • Speaker count index for analytics queries

Performance Guidelines

  • Connection pooling with appropriate timeouts
  • Async/await pattern throughout data access layer
  • Efficient JSONB queries for flexible data storage
  • Proper indexing for common query patterns

Data Validation

Constraints

  • All timestamps use UTC timezone
  • File paths are validated and normalized
  • JSONB data is validated before storage
  • Status values are constrained to valid options

V2 Business Rules

  • Speaker profiles must have unique names per user
  • Processing jobs must have valid status transitions
  • V2 transcripts maintain backward compatibility with v1
  • Speaker count must be non-negative
  • Progress values must be between 0.0 and 1.0

Backup and Recovery

Backup Strategy

  • Regular PostgreSQL backups
  • JSONB data integrity checks
  • Migration rollback procedures
  • Data export capabilities

V2 Recovery Procedures

  • Point-in-time recovery support
  • Migration rollback scripts with data preservation
  • Data validation and repair tools
  • Integrity check procedures
  • Backward compatibility verification

Future Schema Plans

Version 2.1.0 - Advanced Speaker Features

  • Speaker clustering and identification
  • Voice fingerprinting
  • Speaker confidence scoring
  • Multi-language speaker support

Version 2.2.0 - Enhanced Processing

  • Real-time processing capabilities
  • Advanced quality metrics
  • Processing pipeline optimization
  • Performance monitoring

Version 3.0.0 - Enterprise Features

  • Multi-tenant support
  • Advanced analytics and reporting
  • API rate limiting and quotas
  • Enterprise integration features