trax/docs/reports/03-architecture-design.md

523 lines
19 KiB
Markdown

# Checkpoint 3: Architecture Design Report
## Modular Backend Architecture for Media Processing
### 1. Database Architecture (PostgreSQL)
#### Why PostgreSQL
- Multiple services planned (summarizer, frontend server)
- Concurrent access requirements
- Better testing tools (pg_tap, factory patterns)
- Professional migration tools with Alembic
- JSON/JSONB support for transcript data
- Scales better than SQLite for production use
#### Database Schema Design
```sql
-- Core Tables
CREATE TABLE media_files (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source_url TEXT, -- YouTube URL or null for uploads
local_path TEXT NOT NULL, -- Where file is stored
media_type VARCHAR(10), -- mp3, mp4, wav, etc.
duration_seconds INTEGER,
file_size_bytes BIGINT,
download_status VARCHAR(20), -- pending, downloading, completed, failed
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE transcripts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
media_file_id UUID REFERENCES media_files(id),
pipeline_version VARCHAR(10), -- 'v1', 'v2', 'v3', 'v4'
raw_content JSONB NOT NULL, -- Original Whisper output
enhanced_content JSONB, -- AI-enhanced version (v2+)
multipass_content JSONB, -- Multi-pass merged (v3+)
diarized_content JSONB, -- Speaker separated (v4+)
text_content TEXT, -- Plain text for search
model_used VARCHAR(50), -- whisper model version
processing_time_ms INTEGER,
word_count INTEGER,
processing_metadata JSONB, -- Version-specific metadata
created_at TIMESTAMP DEFAULT NOW(),
enhanced_at TIMESTAMP,
updated_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE exports (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
transcript_id UUID REFERENCES transcripts(id),
format VARCHAR(10), -- json, txt
file_path TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE cache_entries (
cache_key VARCHAR(255) PRIMARY KEY,
cache_type VARCHAR(50), -- embedding, query, etc.
value JSONB,
compressed BOOLEAN DEFAULT FALSE,
ttl_seconds INTEGER,
expires_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
-- Batch Processing Tables
CREATE TABLE batch_jobs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
status VARCHAR(20), -- pending, processing, completed, failed
total_items INTEGER,
completed_items INTEGER DEFAULT 0,
failed_items INTEGER DEFAULT 0,
created_at TIMESTAMP DEFAULT NOW(),
completed_at TIMESTAMP
);
CREATE TABLE batch_items (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
batch_job_id UUID REFERENCES batch_jobs(id),
media_file_id UUID REFERENCES media_files(id),
status VARCHAR(20),
error_message TEXT,
processing_order INTEGER,
created_at TIMESTAMP DEFAULT NOW()
);
-- Audio Processing Metadata
CREATE TABLE audio_processing_metadata (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
media_file_id UUID REFERENCES media_files(id),
original_format VARCHAR(10),
original_sample_rate INTEGER,
original_channels INTEGER,
processed_sample_rate INTEGER, -- Should be 16000
processed_channels INTEGER, -- Should be 1 (mono)
noise_level_db FLOAT,
preprocessing_steps JSONB, -- Array of applied steps
quality_score FLOAT, -- 0-1 quality assessment
created_at TIMESTAMP DEFAULT NOW()
);
-- Version-specific tables (added incrementally)
-- Phase 3 adds:
CREATE TABLE multipass_runs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
transcript_id UUID REFERENCES transcripts(id),
pass_number INTEGER,
model_used VARCHAR(50),
confidence_scores JSONB,
segment_variations JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
-- Phase 4 adds:
CREATE TABLE speaker_profiles (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
transcript_id UUID REFERENCES transcripts(id),
speaker_label VARCHAR(50),
voice_embedding BYTEA,
total_speaking_time FLOAT,
created_at TIMESTAMP DEFAULT NOW()
);
```
### 2. Service Layer Architecture (Iterative Design)
#### Modular, Refactorable Structure
```
trax/
├── src/
│ ├── core/
│ │ ├── config.py # Existing configuration
│ │ ├── database.py # PostgreSQL + Alembic setup
│ │ ├── exceptions.py # Custom exceptions
│ │ └── protocols.py # Abstract protocols for services
│ │
│ ├── models/
│ │ ├── __init__.py
│ │ ├── base.py # SQLAlchemy base with registry pattern
│ │ ├── media.py # MediaFile model
│ │ ├── transcript.py # Transcript model
│ │ ├── batch.py # Batch processing models
│ │ └── cache.py # CacheEntry model
│ │
│ ├── services/
│ │ ├── batch/ # PRIORITY: Batch processing first
│ │ │ ├── __init__.py
│ │ │ ├── queue.py # Batch queue management
│ │ │ ├── processor.py # Parallel batch processor
│ │ │ └── monitor.py # Progress tracking
│ │ │
│ │ ├── media/
│ │ │ ├── __init__.py
│ │ │ ├── downloader.py # Generic downloader protocol
│ │ │ ├── youtube.py # YouTube implementation (yt-dlp)
│ │ │ ├── local.py # Local file handler
│ │ │ └── converter.py # FFmpeg service
│ │ │
│ │ ├── audio/ # Pre/post-processing
│ │ │ ├── __init__.py
│ │ │ ├── preprocessor.py # Audio optimization
│ │ │ ├── postprocessor.py # Transcript enhancement
│ │ │ ├── analyzer.py # Quality assessment
│ │ │ └── enhancer.py # Noise reduction
│ │ │
│ │ ├── transcription/ # Iterative versions
│ │ │ ├── v1_basic/ # Phase 1: Single pass
│ │ │ │ ├── whisper.py # Basic transcription
│ │ │ │ └── optimizer.py # M3 optimizations
│ │ │ │
│ │ │ ├── v2_enhanced/ # Phase 2: + AI enhancement
│ │ │ │ ├── enhancer.py # DeepSeek enhancement
│ │ │ │ └── validator.py # Quality checks
│ │ │ │
│ │ │ ├── v3_multipass/ # Phase 3: + Multiple passes
│ │ │ │ ├── multipass.py # Compare multiple runs
│ │ │ │ ├── merger.py # Merge best segments
│ │ │ │ └── confidence.py # Confidence scoring
│ │ │ │
│ │ │ ├── v4_diarization/ # Phase 4: + Speaker identification
│ │ │ │ ├── diarizer.py # Speaker separation
│ │ │ │ ├── voice_db.py # Voice embeddings
│ │ │ │ └── labeler.py # Speaker labels
│ │ │ │
│ │ │ └── pipeline.py # Orchestrates current version
│ │ │
│ │ ├── enhancement/ # AI enhancement layer
│ │ │ ├── __init__.py
│ │ │ ├── protocol.py # Enhancement protocol
│ │ │ ├── deepseek.py # DeepSeek enhancer
│ │ │ ├── enhancer_rules.py # Enhancement rules
│ │ │ └── templates/
│ │ │ ├── enhancement_prompt.txt
│ │ │ └── structured_output.json
│ │ │
│ │ ├── cache/ # Later priority
│ │ │ ├── __init__.py
│ │ │ ├── base.py # Cache protocol
│ │ │ ├── embedding.py # Embedding cache
│ │ │ └── manager.py # Cache orchestrator
│ │ │
│ │ └── export/
│ │ ├── __init__.py
│ │ ├── json_export.py # JSON exporter
│ │ ├── text_backup.py # TXT backup
│ │ └── batch_export.py # Bulk export handling
│ │
│ ├── agents/
│ │ ├── rules/ # Consistency rules
│ │ │ ├── TRANSCRIPTION_RULES.md
│ │ │ ├── BATCH_PROCESSING_RULES.md
│ │ │ ├── CACHING_RULES.md
│ │ │ ├── EXPORT_RULES.md
│ │ │ └── DATABASE_RULES.md
│ │ │
│ │ └── templates/ # Structured outputs
│ │ ├── transcript_output.json
│ │ ├── error_response.json
│ │ └── batch_status.json
│ │
│ └── cli/
│ ├── __init__.py
│ ├── main.py # CLI entry point
│ └── commands/
│ ├── transcribe.py # Transcribe command
│ ├── batch.py # Batch processing
│ ├── export.py # Export command
│ └── cache.py # Cache management
```
### 3. Protocol-Based Design for Maximum Refactorability
#### Core Protocols (Abstract Base Classes)
```python
# src/core/protocols.py
from abc import ABC, abstractmethod
from typing import Protocol, Any, List, Optional
from pathlib import Path
import asyncio
class MediaDownloader(Protocol):
"""Protocol for media downloaders"""
@abstractmethod
async def download(self, source: str, destination: Path) -> Path:
"""Download media from source to destination"""
pass
@abstractmethod
def can_handle(self, source: str) -> bool:
"""Check if this downloader can handle the source"""
pass
class Transcriber(Protocol):
"""Protocol for transcription services"""
@abstractmethod
async def transcribe(self, audio_path: Path) -> dict:
"""Transcribe audio file to structured format"""
pass
@abstractmethod
def get_optimal_settings(self, file_size: int) -> dict:
"""Get optimal settings based on file characteristics"""
pass
class BatchProcessor(Protocol):
"""Protocol for batch processing"""
@abstractmethod
async def process_batch(self, items: List[Path]) -> List[dict]:
"""Process multiple items in parallel"""
pass
@abstractmethod
async def get_progress(self, batch_id: str) -> dict:
"""Get progress of batch processing"""
pass
class Enhancer(Protocol):
"""Protocol for AI enhancement"""
@abstractmethod
async def enhance(self, transcript: dict) -> dict:
"""Enhance transcript with AI"""
pass
class AudioProcessor(Protocol):
"""Protocol for audio processing"""
@abstractmethod
async def preprocess(self, audio_path: Path) -> Path:
"""Preprocess audio for optimal transcription"""
pass
@abstractmethod
async def analyze_quality(self, audio_path: Path) -> float:
"""Analyze audio quality (0-1 score)"""
pass
class CacheService(Protocol):
"""Protocol for cache services"""
@abstractmethod
async def get(self, key: str) -> Any:
"""Get value from cache"""
pass
@abstractmethod
async def set(self, key: str, value: Any, ttl: int) -> None:
"""Set value in cache with TTL"""
pass
class Exporter(Protocol):
"""Protocol for export services"""
@abstractmethod
async def export(self, transcript: dict, path: Path) -> Path:
"""Export transcript to specified format"""
pass
@abstractmethod
def get_supported_formats(self) -> List[str]:
"""Get list of supported export formats"""
pass
```
### 4. Clean Iteration Strategy
#### Phase-Based Pipeline Evolution
```python
# Phase 1: MVP (Week 1-2)
async def transcribe_v1(audio_path: Path) -> dict:
"""Basic transcription with optimizations"""
audio = await preprocess(audio_path) # 16kHz mono
transcript = await whisper.transcribe(audio)
return format_json(transcript)
# Phase 2: Enhanced (Week 3)
async def transcribe_v2(audio_path: Path) -> dict:
"""v1 + AI enhancement"""
transcript = await transcribe_v1(audio_path)
enhanced = await deepseek.enhance(transcript)
return enhanced
# Phase 3: Multi-pass (Week 4-5)
async def transcribe_v3(audio_path: Path) -> dict:
"""v2 + multiple passes for accuracy"""
passes = []
for i in range(3):
transcript = await transcribe_v1_with_params(
audio_path,
temperature=0.1 * i # Vary parameters
)
passes.append(transcript)
merged = merge_best_segments(passes)
enhanced = await deepseek.enhance(merged)
return enhanced
# Phase 4: Diarization (Week 6+)
async def transcribe_v4(audio_path: Path) -> dict:
"""v3 + speaker diarization"""
transcript = await transcribe_v3(audio_path)
# Add speaker identification
speakers = await diarize_audio(audio_path)
labeled = assign_speakers(transcript, speakers)
return labeled
```
### 5. Configuration Management
#### Extended Configuration for Services
```python
# src/core/config.py (extended)
class Config:
# ... existing configuration ...
# Database
DATABASE_URL = os.getenv("DATABASE_URL", "postgresql://localhost/trax")
DATABASE_POOL_SIZE = int(os.getenv("DATABASE_POOL_SIZE", "10"))
# Media Processing
MEDIA_STORAGE_PATH = Path(os.getenv("MEDIA_STORAGE_PATH", "./data/media"))
MAX_FILE_SIZE_MB = int(os.getenv("MAX_FILE_SIZE_MB", "500"))
SUPPORTED_FORMATS = ["mp3", "mp4", "wav", "m4a", "webm"]
# Transcription
WHISPER_MODEL = os.getenv("WHISPER_MODEL", "distil-large-v3")
WHISPER_DEVICE = os.getenv("WHISPER_DEVICE", "cpu")
WHISPER_COMPUTE_TYPE = os.getenv("WHISPER_COMPUTE_TYPE", "int8_float32")
CHUNK_LENGTH_SECONDS = int(os.getenv("CHUNK_LENGTH_SECONDS", "600")) # 10 minutes
# Pipeline Versioning
PIPELINE_VERSION = os.getenv("PIPELINE_VERSION", "v1")
ENABLE_ENHANCEMENT = os.getenv("ENABLE_ENHANCEMENT", "false") == "true"
ENABLE_MULTIPASS = os.getenv("ENABLE_MULTIPASS", "false") == "true"
ENABLE_DIARIZATION = os.getenv("ENABLE_DIARIZATION", "false") == "true"
# Batch Processing
BATCH_SIZE = int(os.getenv("BATCH_SIZE", "10"))
BATCH_TIMEOUT_SECONDS = int(os.getenv("BATCH_TIMEOUT_SECONDS", "3600"))
MAX_PARALLEL_JOBS = int(os.getenv("MAX_PARALLEL_JOBS", "4"))
# AI Enhancement
ENHANCEMENT_MODEL = os.getenv("ENHANCEMENT_MODEL", "deepseek-chat")
ENHANCEMENT_MAX_RETRIES = int(os.getenv("ENHANCEMENT_MAX_RETRIES", "3"))
# Caching
CACHE_TTL_EMBEDDING = int(os.getenv("CACHE_TTL_EMBEDDING", "86400")) # 24h
CACHE_TTL_TRANSCRIPT = int(os.getenv("CACHE_TTL_TRANSCRIPT", "604800")) # 7d
CACHE_BACKEND = os.getenv("CACHE_BACKEND", "sqlite") # sqlite or redis
# Export
EXPORT_PATH = Path(os.getenv("EXPORT_PATH", "./data/exports"))
DEFAULT_EXPORT_FORMAT = os.getenv("DEFAULT_EXPORT_FORMAT", "json")
# Audio Processing
AUDIO_SAMPLE_RATE = int(os.getenv("AUDIO_SAMPLE_RATE", "16000"))
AUDIO_CHANNELS = int(os.getenv("AUDIO_CHANNELS", "1")) # Mono
AUDIO_NORMALIZE_DB = float(os.getenv("AUDIO_NORMALIZE_DB", "-3.0"))
# Multi-pass Settings
MULTIPASS_RUNS = int(os.getenv("MULTIPASS_RUNS", "3"))
MULTIPASS_MERGE_STRATEGY = os.getenv("MULTIPASS_MERGE_STRATEGY", "confidence")
# Diarization Settings
MAX_SPEAKERS = int(os.getenv("MAX_SPEAKERS", "10"))
MIN_SPEAKER_DURATION = float(os.getenv("MIN_SPEAKER_DURATION", "1.0"))
```
### 6. Testing Architecture
#### Test Structure for Easy Refactoring
```python
tests/
├── conftest.py # Shared fixtures
├── factories/ # Test data factories
├── media_factory.py
├── transcript_factory.py
└── batch_factory.py
├── fixtures/
├── audio/
├── sample_5s.wav # 5-second test file
├── sample_30s.mp3 # 30-second test file
├── sample_2m.mp4 # 2-minute test file
└── sample_noisy.wav # Noisy audio for testing
└── transcripts/
├── expected_v1.json # Expected output for v1
├── expected_v2.json # Expected output for v2
└── expected_v3.json # Expected output for v3
├── unit/
├── services/
├── test_batch_processor.py
├── test_downloader.py
├── test_transcriber.py
├── test_enhancer.py
└── test_cache.py
├── models/
└── test_models.py
└── test_protocols.py # Protocol compliance tests
├── integration/
├── test_pipeline_v1.py # Basic pipeline
├── test_pipeline_v2.py # With enhancement
├── test_pipeline_v3.py # With multi-pass
├── test_batch_processing.py # Batch operations
└── test_database.py # Database operations
└── performance/
├── test_speed.py # Speed benchmarks
└── test_accuracy.py # Accuracy measurements
```
### 7. Pipeline Orchestration
#### Smart Pipeline Selection
```python
class PipelineOrchestrator:
"""Orchestrates pipeline based on configuration"""
async def process(self, audio_path: Path) -> dict:
"""Process audio through appropriate pipeline version"""
# Start with v1 (always available)
result = await self.transcribe_v1(audio_path)
# Progressively add features based on config
if config.ENABLE_MULTIPASS:
result = await self.add_multipass(audio_path, result)
if config.ENABLE_ENHANCEMENT:
result = await self.add_enhancement(result)
if config.ENABLE_DIARIZATION:
result = await self.add_diarization(audio_path, result)
return result
async def process_batch(self, paths: List[Path]) -> List[dict]:
"""Process multiple files efficiently"""
tasks = [self.process(path) for path in paths]
return await asyncio.gather(*tasks)
```
### Summary
This architecture provides:
1. **Clean iterations** through versioned pipelines (v1→v4)
2. **Protocol-based design** for easy component swapping
3. **Batch-first approach** as requested
4. **Real test files** instead of mocks
5. **PostgreSQL** for multi-service support
6. **Fail-fast** error handling
7. **CLI-first** interface
The design prioritizes **iterability** and **batch processing** over caching, with clear upgrade paths between versions.
---
*Generated: 2024*
*Status: COMPLETE*
*Next: Team Structure Report*