trax/DB-SCHEMA.md

# Database Schema & Migration History

This document tracks all database schema changes, migrations, and evolution for the Trax Media Processing Platform.

## Database Overview

- **Database**: PostgreSQL with JSONB support
- **ORM**: SQLAlchemy 2.0+ with async support
- **Migrations**: Alembic for schema versioning
- **Registry Pattern**: Prevents SQLAlchemy "multiple classes" errors

## Current Schema (Version 2.0.0)

### Core Tables

```sql
-- YouTube videos metadata
CREATE TABLE youtube_videos (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    youtube_id VARCHAR(20) UNIQUE NOT NULL,
    title TEXT NOT NULL,
    channel TEXT NOT NULL,
    description TEXT,
    duration_seconds INTEGER NOT NULL,
    url TEXT NOT NULL,
    metadata_extracted_at TIMESTAMP DEFAULT NOW(),
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

-- Media files (downloaded from YouTube or local)
CREATE TABLE media_files (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    youtube_video_id UUID REFERENCES youtube_videos(id),
    file_path TEXT NOT NULL,
    file_name TEXT NOT NULL,
    file_size BIGINT NOT NULL,
    duration_seconds FLOAT,
    format_info JSONB,
    download_status VARCHAR(20) DEFAULT 'pending', -- pending, downloading, completed, failed
    download_error TEXT,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

-- Transcription results (Enhanced for v2)
CREATE TABLE transcription_results (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    media_file_id UUID REFERENCES media_files(id),
    version VARCHAR(10) NOT NULL DEFAULT 'v1', -- v1, v2, v3, v4
    text_content TEXT,
    enhanced_content TEXT,
    segments JSONB, -- Whisper segments with timestamps
    raw_content JSONB, -- Complete Whisper API response
    accuracy FLOAT,
    processing_time_seconds FLOAT,
    quality_warnings TEXT[],
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW(),

    -- V2-specific columns (nullable for backward compatibility)
    pipeline_version VARCHAR(20), -- v1, v2, v3, v4
    enhanced_content_v2 JSONB, -- Enhanced transcription content
    diarization_content JSONB, -- Speaker diarization data
    merged_content JSONB, -- Merged content from multiple sources
    domain_used VARCHAR(100), -- Domain-specific processing
    accuracy_estimate FLOAT, -- Estimated accuracy for v2
    speaker_count INTEGER, -- Number of speakers detected
    quality_warnings_v2 JSONB, -- Quality warnings and issues
    processing_metadata JSONB -- Additional processing metadata
);

-- Speaker profiles (NEW in v2)
CREATE TABLE speaker_profiles (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    characteristics JSONB, -- Voice characteristics
    embedding TEXT, -- Speaker embedding (base64 encoded)
    sample_count INTEGER DEFAULT 0, -- Number of samples for this speaker
    user_id INTEGER -- Associated user (optional)
);

-- V2 Processing jobs (NEW in v2)
CREATE TABLE v2_processing_jobs (
    id SERIAL PRIMARY KEY,
    status VARCHAR(50) NOT NULL DEFAULT 'pending',
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    completed_at TIMESTAMP WITH TIME ZONE,
    transcript_id UUID REFERENCES transcription_results(id) ON DELETE CASCADE,
    job_type VARCHAR(50) NOT NULL, -- enhancement, diarization, etc.
    parameters JSONB, -- Job parameters
    progress FLOAT DEFAULT 0, -- Progress percentage (0.0 to 1.0)
    error_message TEXT, -- Error message if failed
    result_data JSONB -- Job result data
);

-- Batch processing jobs
CREATE TABLE batch_jobs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name TEXT NOT NULL,
    status VARCHAR(20) NOT NULL DEFAULT 'pending', -- pending, running, completed, failed, paused
    total_files INTEGER NOT NULL,
    processed_files INTEGER DEFAULT 0,
    failed_files INTEGER DEFAULT 0,
    worker_count INTEGER DEFAULT 8,
    memory_limit_mb INTEGER DEFAULT 2048,
    cpu_limit_percent INTEGER DEFAULT 90,
    start_time TIMESTAMP,
    end_time TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

-- Batch processing items
CREATE TABLE batch_items (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    batch_job_id UUID REFERENCES batch_jobs(id),
    task_type VARCHAR(20) NOT NULL, -- transcribe, enhance, youtube, download, preprocess
    task_data JSONB NOT NULL, -- Task-specific data
    priority INTEGER DEFAULT 0,
    status VARCHAR(20) NOT NULL DEFAULT 'pending', -- pending, running, completed, failed, retrying
    retry_count INTEGER DEFAULT 0,
    max_retries INTEGER DEFAULT 3,
    error_message TEXT,
    result_data JSONB, -- Task result data
    processing_time_seconds FLOAT,
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

-- Audio processing metadata
CREATE TABLE audio_processing_metadata (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    media_file_id UUID REFERENCES media_files(id),
    original_format VARCHAR(20),
    original_sample_rate INTEGER,
    original_channels INTEGER,
    processed_format VARCHAR(20) DEFAULT 'wav',
    processed_sample_rate INTEGER DEFAULT 16000,
    processed_channels INTEGER DEFAULT 1,
    preprocessing_time_seconds FLOAT,
    chunk_count INTEGER,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Export history
CREATE TABLE exports (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    transcript_id UUID REFERENCES transcripts(id),
    format VARCHAR(10) NOT NULL, -- 'json', 'txt', 'srt'
    file_path TEXT NOT NULL,
    file_size BIGINT,
    created_at TIMESTAMP DEFAULT NOW()
);
```

## Schema Evolution Timeline

### Version 2.0.0 - V2 Schema Migration (2024-12-30)
**Status**: ✅ COMPLETED

**New Features**:
- Speaker profiles for speaker diarization and identification
- V2 processing jobs for individual transcript processing
- Enhanced transcription results with v2-specific columns
- Backward compatibility layer for v1 clients
- Comprehensive data migration utilities

**Schema Changes**:
```sql
-- New tables for v2 features
CREATE TABLE speaker_profiles (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    characteristics JSONB,
    embedding TEXT,
    sample_count INTEGER DEFAULT 0,
    user_id INTEGER
);

CREATE TABLE v2_processing_jobs (
    id SERIAL PRIMARY KEY,
    status VARCHAR(50) NOT NULL DEFAULT 'pending',
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    completed_at TIMESTAMP WITH TIME ZONE,
    transcript_id UUID REFERENCES transcription_results(id) ON DELETE CASCADE,
    job_type VARCHAR(50) NOT NULL,
    parameters JSONB,
    progress FLOAT DEFAULT 0,
    error_message TEXT,
    result_data JSONB
);

-- Enhanced transcription_results table with v2 columns
ALTER TABLE transcription_results
ADD COLUMN pipeline_version VARCHAR(20),
ADD COLUMN enhanced_content_v2 JSONB,
ADD COLUMN diarization_content JSONB,
ADD COLUMN merged_content JSONB,
ADD COLUMN domain_used VARCHAR(100),
ADD COLUMN accuracy_estimate FLOAT,
ADD COLUMN speaker_count INTEGER,
ADD COLUMN quality_warnings_v2 JSONB,
ADD COLUMN processing_metadata JSONB;
```

**Key Improvements**:
- **Backward Compatibility**: All v2 columns are nullable, allowing v1 clients to continue working
- **Speaker Diarization Support**: Speaker profiles and diarization content storage
- **Enhanced Processing**: Individual processing jobs with progress tracking
- **Domain-Specific Processing**: Support for different processing domains
- **Quality Metrics**: Enhanced accuracy estimation and quality warnings
- **Metadata Tracking**: Comprehensive processing metadata storage

### Version 0.2.0 - Batch Processing System (2024-12-30)
**Status**: ✅ COMPLETED

**New Features**:
- Batch processing jobs and items tracking
- Task type support (transcribe, enhance, youtube, download, preprocess)
- Priority-based task processing
- Retry mechanism with configurable limits
- Processing time and error tracking
- Resource monitoring integration

**Schema Changes**:
```sql
-- Added batch processing tables
CREATE TABLE batch_jobs (...);
CREATE TABLE batch_items (...);

-- Enhanced existing tables
ALTER TABLE batch_items ADD COLUMN task_type VARCHAR(20) NOT NULL;
ALTER TABLE batch_items ADD COLUMN task_data JSONB NOT NULL;
ALTER TABLE batch_items ADD COLUMN priority INTEGER DEFAULT 0;
ALTER TABLE batch_items ADD COLUMN retry_count INTEGER DEFAULT 0;
ALTER TABLE batch_items ADD COLUMN max_retries INTEGER DEFAULT 3;
ALTER TABLE batch_items ADD COLUMN result_data JSONB;
```

### Version 0.1.1 - AI Enhancement Service (2024-12-25)
**Status**: ✅ COMPLETED

**New Features**:
- Enhanced content storage in transcripts table
- Quality validation and accuracy tracking
- Processing time and quality warnings
- Caching support for enhancement results

**Schema Changes**:
```sql
-- Enhanced transcripts table
ALTER TABLE transcripts ADD COLUMN enhanced_content TEXT;
ALTER TABLE transcripts ADD COLUMN accuracy FLOAT;
ALTER TABLE transcripts ADD COLUMN quality_warnings TEXT[];
```

### Version 0.1.0 - Core Services (2024-12-19)
**Status**: ✅ COMPLETED

**Initial Schema**:
- YouTube videos metadata extraction
- Media files download and storage
- Basic transcription with Whisper API
- Audio processing metadata tracking
- Export functionality

**Core Tables Created**:
- `youtube_videos` - YouTube metadata storage
- `media_files` - Downloaded media file tracking
- `transcripts` - Transcription results storage
- `audio_processing_metadata` - Audio processing details
- `exports` - Export history tracking

## Migration History

### Migration 004 - V2 Schema Migration (2024-12-30)
**Status**: ✅ COMPLETED

**Implementation Notes**:
- **Test Database Setup**: Required explicit creation of `trax_test` database
- **Foreign Key Dependencies**: Tests must create records in dependency order (media_files → transcription_jobs → transcription_results)
- **Schema Validation**: All 15 schema tests pass, validating v2 structure
- **Backward Compatibility**: v1 clients can continue working with v2 schema
- **Helper Methods**: Created `_create_test_transcript()` helper for test data creation

**Lessons for Next Time**:
1. **Test Database Isolation**: Always create separate test database before running schema tests
2. **Dependency Order**: When testing foreign key relationships, create parent records first
3. **Schema Matching**: Ensure test expectations match actual database schema (column types, nullability)
4. **Helper Functions**: Create reusable test helpers for complex data setup
5. **Migration Testing**: Test both upgrade and downgrade paths for migrations

```sql
-- Create speaker_profiles table
CREATE TABLE speaker_profiles (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    characteristics JSONB,
    embedding TEXT,
    sample_count INTEGER DEFAULT 0,
    user_id INTEGER
);

-- Create v2_processing_jobs table
CREATE TABLE v2_processing_jobs (
    id SERIAL PRIMARY KEY,
    status VARCHAR(50) NOT NULL DEFAULT 'pending',
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    completed_at TIMESTAMP WITH TIME ZONE,
    transcript_id UUID REFERENCES transcription_results(id) ON DELETE CASCADE,
    job_type VARCHAR(50) NOT NULL,
    parameters JSONB,
    progress FLOAT DEFAULT 0,
    error_message TEXT,
    result_data JSONB
);

-- Add v2 columns to transcription_results table
ALTER TABLE transcription_results
ADD COLUMN pipeline_version VARCHAR(20),
ADD COLUMN enhanced_content_v2 JSONB,
ADD COLUMN diarization_content JSONB,
ADD COLUMN merged_content JSONB,
ADD COLUMN domain_used VARCHAR(100),
ADD COLUMN accuracy_estimate FLOAT,
ADD COLUMN speaker_count INTEGER,
ADD COLUMN quality_warnings_v2 JSONB,
ADD COLUMN processing_metadata JSONB;

-- Create indexes for performance
CREATE INDEX ix_speaker_profiles_name ON speaker_profiles(name);
CREATE INDEX ix_speaker_profiles_user_id ON speaker_profiles(user_id);
CREATE INDEX ix_v2_processing_jobs_status ON v2_processing_jobs(status);
CREATE INDEX ix_v2_processing_jobs_transcript_id ON v2_processing_jobs(transcript_id);
CREATE INDEX ix_v2_processing_jobs_job_type ON v2_processing_jobs(job_type);
CREATE INDEX ix_transcription_results_pipeline_version ON transcription_results(pipeline_version);
CREATE INDEX ix_transcription_results_domain_used ON transcription_results(domain_used);
CREATE INDEX ix_transcription_results_speaker_count ON transcription_results(speaker_count);

-- Update existing transcripts to v1
UPDATE transcription_results SET pipeline_version = 'v1' WHERE pipeline_version IS NULL;
```

### Migration 003 - Batch Processing Support (2024-12-30)
```sql
-- Add batch processing tables
CREATE TABLE batch_jobs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name TEXT NOT NULL,
    status VARCHAR(20) NOT NULL DEFAULT 'pending',
    total_files INTEGER NOT NULL,
    processed_files INTEGER DEFAULT 0,
    failed_files INTEGER DEFAULT 0,
    worker_count INTEGER DEFAULT 8,
    memory_limit_mb INTEGER DEFAULT 2048,
    cpu_limit_percent INTEGER DEFAULT 90,
    start_time TIMESTAMP,
    end_time TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE batch_items (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    batch_job_id UUID REFERENCES batch_jobs(id),
    task_type VARCHAR(20) NOT NULL,
    task_data JSONB NOT NULL,
    priority INTEGER DEFAULT 0,
    status VARCHAR(20) NOT NULL DEFAULT 'pending',
    retry_count INTEGER DEFAULT 0,
    max_retries INTEGER DEFAULT 3,
    error_message TEXT,
    result_data JSONB,
    processing_time_seconds FLOAT,
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);
```

### Migration 002 - AI Enhancement Support (2024-12-25)
```sql
-- Add enhancement support to transcripts
ALTER TABLE transcripts ADD COLUMN enhanced_content TEXT;
ALTER TABLE transcripts ADD COLUMN accuracy FLOAT;
ALTER TABLE transcripts ADD COLUMN quality_warnings TEXT[];
```

### Migration 001 - Initial Schema (2024-12-19)
```sql
-- Create core tables
CREATE TABLE youtube_videos (...);
CREATE TABLE media_files (...);
CREATE TABLE transcripts (...);
CREATE TABLE audio_processing_metadata (...);
CREATE TABLE exports (...);
```

## Data Relationships

### Core Relationships
```
youtube_videos (1) ←→ (many) media_files
media_files (1) ←→ (many) transcription_results
transcription_results (1) ←→ (many) exports
transcription_results (1) ←→ (many) v2_processing_jobs
batch_jobs (1) ←→ (many) batch_items
media_files (1) ←→ (1) audio_processing_metadata
```

### V2 Relationships
```
speaker_profiles (many) ←→ (1) users (future)
transcription_results (1) ←→ (many) v2_processing_jobs
transcription_results (1) ←→ (many) speaker_profiles (via diarization_content)
```

### Task Processing Flow
```
batch_jobs → batch_items → task processing → results
v2_processing_jobs → individual processing → enhanced results
```

## Indexes and Performance

### Primary Indexes
- All tables have UUID primary keys with `gen_random_uuid()` default
- Foreign key relationships are properly indexed
- JSONB columns support efficient querying

### V2 Performance Optimizations
- Speaker profile name and user_id indexes for quick lookups
- Processing job status and type indexes for job management
- Transcript pipeline version and domain indexes for filtering
- Speaker count index for analytics queries

### Performance Guidelines
- Connection pooling with appropriate timeouts
- Async/await pattern throughout data access layer
- Efficient JSONB queries for flexible data storage
- Proper indexing for common query patterns

## Data Validation

### Constraints
- All timestamps use UTC timezone
- File paths are validated and normalized
- JSONB data is validated before storage
- Status values are constrained to valid options

### V2 Business Rules
- Speaker profiles must have unique names per user
- Processing jobs must have valid status transitions
- V2 transcripts maintain backward compatibility with v1
- Speaker count must be non-negative
- Progress values must be between 0.0 and 1.0

## Backup and Recovery

### Backup Strategy
- Regular PostgreSQL backups
- JSONB data integrity checks
- Migration rollback procedures
- Data export capabilities

### V2 Recovery Procedures
- Point-in-time recovery support
- Migration rollback scripts with data preservation
- Data validation and repair tools
- Integrity check procedures
- Backward compatibility verification

## Future Schema Plans

### Version 2.1.0 - Advanced Speaker Features
- Speaker clustering and identification
- Voice fingerprinting
- Speaker confidence scoring
- Multi-language speaker support

### Version 2.2.0 - Enhanced Processing
- Real-time processing capabilities
- Advanced quality metrics
- Processing pipeline optimization
- Performance monitoring

### Version 3.0.0 - Enterprise Features
- Multi-tenant support
- Advanced analytics and reporting
- API rate limiting and quotas
- Enterprise integration features