trax/docs/DATABASE.md

347 lines
9.4 KiB
Markdown

# Database Schema Documentation
Complete reference for Trax PostgreSQL database schema with JSONB support.
## Overview
Trax uses PostgreSQL 15+ with the following key features:
- **JSONB columns** for flexible metadata storage
- **UUID primary keys** for distributed system compatibility
- **Registry pattern** to prevent SQLAlchemy "multiple classes" errors
- **Timestamp mixins** for automatic created_at/updated_at tracking
- **Version tracking** for iterative pipeline results
## Core Tables
### youtube_videos
Stores YouTube video metadata extracted via curl (no API required).
```sql
CREATE TABLE youtube_videos (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
youtube_id VARCHAR(20) NOT NULL UNIQUE,
title VARCHAR(500) NOT NULL,
channel VARCHAR(200) NOT NULL,
description TEXT,
duration_seconds INTEGER NOT NULL,
url VARCHAR(500) NOT NULL,
metadata_extracted_at TIMESTAMP DEFAULT NOW(),
created_at TIMESTAMP DEFAULT NOW() NOT NULL,
updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);
CREATE INDEX idx_youtube_videos_youtube_id ON youtube_videos(youtube_id);
```
**Key Features:**
- **No API dependency** - Metadata extracted via curl scraping
- **Unique constraint** on youtube_id prevents duplicates
- **Flexible description** storage for any length content
### media_files
Central table for all media files requiring transcription.
```sql
CREATE TABLE media_files (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
filename VARCHAR(255) NOT NULL,
file_size BIGINT NOT NULL,
duration FLOAT,
mime_type VARCHAR(100),
source_path TEXT NOT NULL,
local_path TEXT,
file_hash VARCHAR(64) UNIQUE,
file_metadata JSONB DEFAULT '{}',
status VARCHAR(20) DEFAULT 'pending' NOT NULL,
youtube_video_id UUID REFERENCES youtube_videos(id),
created_at TIMESTAMP DEFAULT NOW() NOT NULL,
updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);
CREATE INDEX idx_media_files_status ON media_files(status);
CREATE INDEX idx_media_files_hash ON media_files(file_hash);
CREATE INDEX idx_media_files_youtube_video_id ON media_files(youtube_video_id);
```
**Status Values:**
- `pending` - File identified, not yet processed
- `downloading` - Currently downloading from source
- `processing` - Audio preprocessing in progress
- `ready` - Ready for transcription
- `failed` - Processing failed
**JSONB file_metadata Example:**
```json
{
"format": "mp4",
"codec": "aac",
"bitrate": 128000,
"sample_rate": 44100,
"channels": 2,
"ffmpeg_info": {...}
}
```
### transcription_jobs
Tracks individual transcription requests with retry logic.
```sql
CREATE TABLE transcription_jobs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
media_file_id UUID NOT NULL REFERENCES media_files(id),
status VARCHAR(20) DEFAULT 'pending' NOT NULL,
priority INTEGER DEFAULT 0,
model_config JSONB DEFAULT '{}',
processing_options JSONB DEFAULT '{}',
started_at TIMESTAMP,
completed_at TIMESTAMP,
processing_time FLOAT,
error_message TEXT,
retry_count INTEGER DEFAULT 0,
max_retries INTEGER DEFAULT 3,
created_at TIMESTAMP DEFAULT NOW() NOT NULL,
updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);
CREATE INDEX idx_transcription_jobs_status ON transcription_jobs(status);
CREATE INDEX idx_transcription_jobs_priority ON transcription_jobs(priority);
CREATE INDEX idx_transcription_jobs_media_file_id ON transcription_jobs(media_file_id);
```
**Status Values:**
- `pending` - Queued for processing
- `processing` - Currently being transcribed
- `completed` - Successfully completed
- `failed` - Failed after max retries
**model_config Example:**
```json
{
"model": "distil-large-v3",
"language": "en",
"temperature": 0.0,
"response_format": "verbose_json"
}
```
### transcription_results
Stores actual transcription outputs with versioning support.
```sql
CREATE TABLE transcription_results (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
job_id UUID NOT NULL REFERENCES transcription_jobs(id),
media_file_id UUID NOT NULL REFERENCES media_files(id),
pipeline_version VARCHAR(10) DEFAULT 'v1' NOT NULL,
content JSONB NOT NULL,
segments JSONB,
confidence_scores JSONB,
speaker_info JSONB,
accuracy FLOAT,
word_count INTEGER,
processing_time FLOAT,
model_used VARCHAR(100),
model_config JSONB,
parent_result_id UUID REFERENCES transcription_results(id),
version INTEGER DEFAULT 1 NOT NULL,
created_at TIMESTAMP DEFAULT NOW() NOT NULL,
updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);
CREATE INDEX idx_transcription_results_pipeline_version ON transcription_results(pipeline_version);
CREATE INDEX idx_transcription_results_media_file_id ON transcription_results(media_file_id);
CREATE INDEX idx_transcription_results_parent_result_id ON transcription_results(parent_result_id);
```
**Pipeline Versions:**
- `v1` - Whisper distil-large-v3 only
- `v2` - Whisper + DeepSeek enhancement
- `v3` - Multi-pass accuracy optimization
- `v4` - Speaker diarization support
**content JSONB Example (v1):**
```json
{
"text": "Complete transcript text here...",
"language": "en",
"segments": [
{
"id": 0,
"start": 0.0,
"end": 1.5,
"text": "Hello world",
"confidence": 0.95
}
]
}
```
**content JSONB Example (v2 - Enhanced):**
```json
{
"original_text": "Original Whisper output...",
"enhanced_text": "Enhanced and corrected text...",
"improvements": [
"Fixed grammar in sentence 3",
"Corrected technical terms",
"Added punctuation"
],
"enhancement_metadata": {
"model": "deepseek-chat",
"confidence": 0.98,
"processing_time_ms": 2500
}
}
```
### processing_jobs
Tracks batch processing operations.
```sql
CREATE TABLE processing_jobs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
job_type VARCHAR(50) NOT NULL,
status VARCHAR(20) DEFAULT 'pending' NOT NULL,
config JSONB DEFAULT '{}',
file_patterns JSONB,
total_items INTEGER DEFAULT 0,
processed_items INTEGER DEFAULT 0,
successful_items INTEGER DEFAULT 0,
failed_items INTEGER DEFAULT 0,
started_at TIMESTAMP,
completed_at TIMESTAMP,
error_message TEXT,
created_at TIMESTAMP DEFAULT NOW() NOT NULL,
updated_at TIMESTAMP DEFAULT NOW() NOT NULL
);
CREATE INDEX idx_processing_jobs_type ON processing_jobs(job_type);
CREATE INDEX idx_processing_jobs_status ON processing_jobs(status);
```
## Relationships
```
youtube_videos (1) ──→ (0..n) media_files
media_files (1) ──→ (0..n) transcription_jobs
transcription_jobs (1) ──→ (0..n) transcription_results
transcription_results (0..1) ──→ (0..n) transcription_results (parent/child versioning)
```
## Registry Pattern Implementation
Trax uses SQLAlchemy's registry pattern to prevent "multiple classes" errors:
```python
# src/database/__init__.py
from sqlalchemy.orm import declarative_base, registry
mapper_registry = registry()
Base = declarative_base(registry=mapper_registry)
def register_model(cls):
"""Register a model class to prevent multiple registration errors."""
return cls
# src/database/models.py
@register_model
class YouTubeVideo(Base, TimestampedMixin):
__tablename__ = "youtube_videos"
# ... model definition
```
## JSONB Query Examples
### Search transcript content
```sql
-- Find transcripts containing specific text
SELECT tr.id, mf.filename, tr.content->>'text'
FROM transcription_results tr
JOIN media_files mf ON tr.media_file_id = mf.id
WHERE tr.content->>'text' ILIKE '%machine learning%';
-- Find high-confidence segments
SELECT tr.id, segment->>'text', segment->>'confidence'
FROM transcription_results tr,
jsonb_array_elements(tr.content->'segments') segment
WHERE (segment->>'confidence')::float > 0.95;
```
### Query file metadata
```sql
-- Find files by format
SELECT filename, file_metadata->>'format', file_metadata->>'duration'
FROM media_files
WHERE file_metadata->>'format' = 'mp4';
-- Files larger than 100MB
SELECT filename, file_size, file_metadata->>'bitrate'
FROM media_files
WHERE file_size > 104857600;
```
### Performance analytics
```sql
-- Average processing time by pipeline version
SELECT
pipeline_version,
AVG(processing_time) as avg_time,
AVG(accuracy) as avg_accuracy,
COUNT(*) as total_transcripts
FROM transcription_results
GROUP BY pipeline_version;
-- Failed jobs analysis
SELECT
DATE(created_at) as date,
COUNT(*) as failed_jobs,
COUNT(DISTINCT media_file_id) as unique_files
FROM transcription_jobs
WHERE status = 'failed'
GROUP BY DATE(created_at)
ORDER BY date DESC;
```
## Migrations
Database schema is managed via Alembic migrations:
```bash
# Create new migration
uv run alembic revision -m "Add new feature"
# Apply migrations
uv run alembic upgrade head
# Downgrade one revision
uv run alembic downgrade -1
# Show current revision
uv run alembic current
```
## Performance Considerations
### Indexes
- All foreign keys are indexed
- Status columns are indexed for filtering
- JSONB columns use GIN indexes for text search
### Partitioning (Future)
For high-volume usage, consider partitioning large tables:
- `transcription_results` by `created_at` (monthly partitions)
- `media_files` by `status` and `created_at`
### Connection Pooling
PostgreSQL connection pooling is configured for optimal performance:
- Pool size: 20 connections
- Max overflow: 30 connections
- Pool timeout: 30 seconds
For complete schema updates, see migration files in `migrations/versions/`.