347 lines
9.4 KiB
Markdown
347 lines
9.4 KiB
Markdown
# Database Schema Documentation
|
|
|
|
Complete reference for Trax PostgreSQL database schema with JSONB support.
|
|
|
|
## Overview
|
|
|
|
Trax uses PostgreSQL 15+ with the following key features:
|
|
- **JSONB columns** for flexible metadata storage
|
|
- **UUID primary keys** for distributed system compatibility
|
|
- **Registry pattern** to prevent SQLAlchemy "multiple classes" errors
|
|
- **Timestamp mixins** for automatic created_at/updated_at tracking
|
|
- **Version tracking** for iterative pipeline results
|
|
|
|
## Core Tables
|
|
|
|
### youtube_videos
|
|
|
|
Stores YouTube video metadata extracted via curl (no API required).
|
|
|
|
```sql
|
|
CREATE TABLE youtube_videos (
|
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
youtube_id VARCHAR(20) NOT NULL UNIQUE,
|
|
title VARCHAR(500) NOT NULL,
|
|
channel VARCHAR(200) NOT NULL,
|
|
description TEXT,
|
|
duration_seconds INTEGER NOT NULL,
|
|
url VARCHAR(500) NOT NULL,
|
|
metadata_extracted_at TIMESTAMP DEFAULT NOW(),
|
|
created_at TIMESTAMP DEFAULT NOW() NOT NULL,
|
|
updated_at TIMESTAMP DEFAULT NOW() NOT NULL
|
|
);
|
|
|
|
CREATE INDEX idx_youtube_videos_youtube_id ON youtube_videos(youtube_id);
|
|
```
|
|
|
|
**Key Features:**
|
|
- **No API dependency** - Metadata extracted via curl scraping
|
|
- **Unique constraint** on youtube_id prevents duplicates
|
|
- **Flexible description** storage for any length content
|
|
|
|
### media_files
|
|
|
|
Central table for all media files requiring transcription.
|
|
|
|
```sql
|
|
CREATE TABLE media_files (
|
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
filename VARCHAR(255) NOT NULL,
|
|
file_size BIGINT NOT NULL,
|
|
duration FLOAT,
|
|
mime_type VARCHAR(100),
|
|
source_path TEXT NOT NULL,
|
|
local_path TEXT,
|
|
file_hash VARCHAR(64) UNIQUE,
|
|
file_metadata JSONB DEFAULT '{}',
|
|
status VARCHAR(20) DEFAULT 'pending' NOT NULL,
|
|
youtube_video_id UUID REFERENCES youtube_videos(id),
|
|
created_at TIMESTAMP DEFAULT NOW() NOT NULL,
|
|
updated_at TIMESTAMP DEFAULT NOW() NOT NULL
|
|
);
|
|
|
|
CREATE INDEX idx_media_files_status ON media_files(status);
|
|
CREATE INDEX idx_media_files_hash ON media_files(file_hash);
|
|
CREATE INDEX idx_media_files_youtube_video_id ON media_files(youtube_video_id);
|
|
```
|
|
|
|
**Status Values:**
|
|
- `pending` - File identified, not yet processed
|
|
- `downloading` - Currently downloading from source
|
|
- `processing` - Audio preprocessing in progress
|
|
- `ready` - Ready for transcription
|
|
- `failed` - Processing failed
|
|
|
|
**JSONB file_metadata Example:**
|
|
```json
|
|
{
|
|
"format": "mp4",
|
|
"codec": "aac",
|
|
"bitrate": 128000,
|
|
"sample_rate": 44100,
|
|
"channels": 2,
|
|
"ffmpeg_info": {...}
|
|
}
|
|
```
|
|
|
|
### transcription_jobs
|
|
|
|
Tracks individual transcription requests with retry logic.
|
|
|
|
```sql
|
|
CREATE TABLE transcription_jobs (
|
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
media_file_id UUID NOT NULL REFERENCES media_files(id),
|
|
status VARCHAR(20) DEFAULT 'pending' NOT NULL,
|
|
priority INTEGER DEFAULT 0,
|
|
model_config JSONB DEFAULT '{}',
|
|
processing_options JSONB DEFAULT '{}',
|
|
started_at TIMESTAMP,
|
|
completed_at TIMESTAMP,
|
|
processing_time FLOAT,
|
|
error_message TEXT,
|
|
retry_count INTEGER DEFAULT 0,
|
|
max_retries INTEGER DEFAULT 3,
|
|
created_at TIMESTAMP DEFAULT NOW() NOT NULL,
|
|
updated_at TIMESTAMP DEFAULT NOW() NOT NULL
|
|
);
|
|
|
|
CREATE INDEX idx_transcription_jobs_status ON transcription_jobs(status);
|
|
CREATE INDEX idx_transcription_jobs_priority ON transcription_jobs(priority);
|
|
CREATE INDEX idx_transcription_jobs_media_file_id ON transcription_jobs(media_file_id);
|
|
```
|
|
|
|
**Status Values:**
|
|
- `pending` - Queued for processing
|
|
- `processing` - Currently being transcribed
|
|
- `completed` - Successfully completed
|
|
- `failed` - Failed after max retries
|
|
|
|
**model_config Example:**
|
|
```json
|
|
{
|
|
"model": "distil-large-v3",
|
|
"language": "en",
|
|
"temperature": 0.0,
|
|
"response_format": "verbose_json"
|
|
}
|
|
```
|
|
|
|
### transcription_results
|
|
|
|
Stores actual transcription outputs with versioning support.
|
|
|
|
```sql
|
|
CREATE TABLE transcription_results (
|
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
job_id UUID NOT NULL REFERENCES transcription_jobs(id),
|
|
media_file_id UUID NOT NULL REFERENCES media_files(id),
|
|
pipeline_version VARCHAR(10) DEFAULT 'v1' NOT NULL,
|
|
content JSONB NOT NULL,
|
|
segments JSONB,
|
|
confidence_scores JSONB,
|
|
speaker_info JSONB,
|
|
accuracy FLOAT,
|
|
word_count INTEGER,
|
|
processing_time FLOAT,
|
|
model_used VARCHAR(100),
|
|
model_config JSONB,
|
|
parent_result_id UUID REFERENCES transcription_results(id),
|
|
version INTEGER DEFAULT 1 NOT NULL,
|
|
created_at TIMESTAMP DEFAULT NOW() NOT NULL,
|
|
updated_at TIMESTAMP DEFAULT NOW() NOT NULL
|
|
);
|
|
|
|
CREATE INDEX idx_transcription_results_pipeline_version ON transcription_results(pipeline_version);
|
|
CREATE INDEX idx_transcription_results_media_file_id ON transcription_results(media_file_id);
|
|
CREATE INDEX idx_transcription_results_parent_result_id ON transcription_results(parent_result_id);
|
|
```
|
|
|
|
**Pipeline Versions:**
|
|
- `v1` - Whisper distil-large-v3 only
|
|
- `v2` - Whisper + DeepSeek enhancement
|
|
- `v3` - Multi-pass accuracy optimization
|
|
- `v4` - Speaker diarization support
|
|
|
|
**content JSONB Example (v1):**
|
|
```json
|
|
{
|
|
"text": "Complete transcript text here...",
|
|
"language": "en",
|
|
"segments": [
|
|
{
|
|
"id": 0,
|
|
"start": 0.0,
|
|
"end": 1.5,
|
|
"text": "Hello world",
|
|
"confidence": 0.95
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**content JSONB Example (v2 - Enhanced):**
|
|
```json
|
|
{
|
|
"original_text": "Original Whisper output...",
|
|
"enhanced_text": "Enhanced and corrected text...",
|
|
"improvements": [
|
|
"Fixed grammar in sentence 3",
|
|
"Corrected technical terms",
|
|
"Added punctuation"
|
|
],
|
|
"enhancement_metadata": {
|
|
"model": "deepseek-chat",
|
|
"confidence": 0.98,
|
|
"processing_time_ms": 2500
|
|
}
|
|
}
|
|
```
|
|
|
|
### processing_jobs
|
|
|
|
Tracks batch processing operations.
|
|
|
|
```sql
|
|
CREATE TABLE processing_jobs (
|
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
job_type VARCHAR(50) NOT NULL,
|
|
status VARCHAR(20) DEFAULT 'pending' NOT NULL,
|
|
config JSONB DEFAULT '{}',
|
|
file_patterns JSONB,
|
|
total_items INTEGER DEFAULT 0,
|
|
processed_items INTEGER DEFAULT 0,
|
|
successful_items INTEGER DEFAULT 0,
|
|
failed_items INTEGER DEFAULT 0,
|
|
started_at TIMESTAMP,
|
|
completed_at TIMESTAMP,
|
|
error_message TEXT,
|
|
created_at TIMESTAMP DEFAULT NOW() NOT NULL,
|
|
updated_at TIMESTAMP DEFAULT NOW() NOT NULL
|
|
);
|
|
|
|
CREATE INDEX idx_processing_jobs_type ON processing_jobs(job_type);
|
|
CREATE INDEX idx_processing_jobs_status ON processing_jobs(status);
|
|
```
|
|
|
|
## Relationships
|
|
|
|
```
|
|
youtube_videos (1) ──→ (0..n) media_files
|
|
media_files (1) ──→ (0..n) transcription_jobs
|
|
transcription_jobs (1) ──→ (0..n) transcription_results
|
|
transcription_results (0..1) ──→ (0..n) transcription_results (parent/child versioning)
|
|
```
|
|
|
|
## Registry Pattern Implementation
|
|
|
|
Trax uses SQLAlchemy's registry pattern to prevent "multiple classes" errors:
|
|
|
|
```python
|
|
# src/database/__init__.py
|
|
from sqlalchemy.orm import declarative_base, registry
|
|
|
|
mapper_registry = registry()
|
|
Base = declarative_base(registry=mapper_registry)
|
|
|
|
def register_model(cls):
|
|
"""Register a model class to prevent multiple registration errors."""
|
|
return cls
|
|
|
|
# src/database/models.py
|
|
@register_model
|
|
class YouTubeVideo(Base, TimestampedMixin):
|
|
__tablename__ = "youtube_videos"
|
|
# ... model definition
|
|
```
|
|
|
|
## JSONB Query Examples
|
|
|
|
### Search transcript content
|
|
```sql
|
|
-- Find transcripts containing specific text
|
|
SELECT tr.id, mf.filename, tr.content->>'text'
|
|
FROM transcription_results tr
|
|
JOIN media_files mf ON tr.media_file_id = mf.id
|
|
WHERE tr.content->>'text' ILIKE '%machine learning%';
|
|
|
|
-- Find high-confidence segments
|
|
SELECT tr.id, segment->>'text', segment->>'confidence'
|
|
FROM transcription_results tr,
|
|
jsonb_array_elements(tr.content->'segments') segment
|
|
WHERE (segment->>'confidence')::float > 0.95;
|
|
```
|
|
|
|
### Query file metadata
|
|
```sql
|
|
-- Find files by format
|
|
SELECT filename, file_metadata->>'format', file_metadata->>'duration'
|
|
FROM media_files
|
|
WHERE file_metadata->>'format' = 'mp4';
|
|
|
|
-- Files larger than 100MB
|
|
SELECT filename, file_size, file_metadata->>'bitrate'
|
|
FROM media_files
|
|
WHERE file_size > 104857600;
|
|
```
|
|
|
|
### Performance analytics
|
|
```sql
|
|
-- Average processing time by pipeline version
|
|
SELECT
|
|
pipeline_version,
|
|
AVG(processing_time) as avg_time,
|
|
AVG(accuracy) as avg_accuracy,
|
|
COUNT(*) as total_transcripts
|
|
FROM transcription_results
|
|
GROUP BY pipeline_version;
|
|
|
|
-- Failed jobs analysis
|
|
SELECT
|
|
DATE(created_at) as date,
|
|
COUNT(*) as failed_jobs,
|
|
COUNT(DISTINCT media_file_id) as unique_files
|
|
FROM transcription_jobs
|
|
WHERE status = 'failed'
|
|
GROUP BY DATE(created_at)
|
|
ORDER BY date DESC;
|
|
```
|
|
|
|
## Migrations
|
|
|
|
Database schema is managed via Alembic migrations:
|
|
|
|
```bash
|
|
# Create new migration
|
|
uv run alembic revision -m "Add new feature"
|
|
|
|
# Apply migrations
|
|
uv run alembic upgrade head
|
|
|
|
# Downgrade one revision
|
|
uv run alembic downgrade -1
|
|
|
|
# Show current revision
|
|
uv run alembic current
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
### Indexes
|
|
- All foreign keys are indexed
|
|
- Status columns are indexed for filtering
|
|
- JSONB columns use GIN indexes for text search
|
|
|
|
### Partitioning (Future)
|
|
For high-volume usage, consider partitioning large tables:
|
|
- `transcription_results` by `created_at` (monthly partitions)
|
|
- `media_files` by `status` and `created_at`
|
|
|
|
### Connection Pooling
|
|
PostgreSQL connection pooling is configured for optimal performance:
|
|
- Pool size: 20 connections
|
|
- Max overflow: 30 connections
|
|
- Pool timeout: 30 seconds
|
|
|
|
For complete schema updates, see migration files in `migrations/versions/`.
|