trax/docs/architecture/development-patterns.md

9.6 KiB

Development Patterns & Historical Learnings

Overview

This document captures the key architectural patterns and lessons learned from the YouTube Summarizer project that should be applied to Trax development.

Successful Patterns to Implement

1. Download-First Architecture

Why it worked: Prevents streaming failures, enables retry without re-downloading, allows offline processing.

# ALWAYS download media before processing - never stream
async def acquire_media(source: str) -> Path:
    if source.startswith(('http://', 'https://')):
        return await download_media(source)  # Download to temp file
    elif Path(source).exists():
        return Path(source)
    else:
        raise ValueError(f"Invalid source: {source}")

2. Multi-Layer Caching System (90% cost reduction)

Why it worked: Different data has different lifespans, aggressive caching reduces API costs dramatically.

# Different TTLs for different data types
cache_layers = {
    'embedding': 86400,      # 24h - embeddings are stable
    'analysis': 604800,      # 7d - multi-agent results
    'query': 21600,          # 6h - RAG query results
    'prompt': 2592000        # 30d - prompt complexity
}

3. Protocol-Based Services

Why it worked: Easy swapping of implementations, maximum refactorability.

from typing import Protocol

class TranscriptionProtocol(Protocol):
    async def transcribe(self, audio: Path) -> dict:
        pass

# Easy swapping of implementations
class FasterWhisperService:
    async def transcribe(self, audio: Path) -> dict:
        # Implementation
        pass

4. Database Registry Pattern

Why it worked: Prevents SQLAlchemy "multiple classes" errors, centralized model registration.

# Prevents SQLAlchemy "multiple classes" errors
class DatabaseRegistry:
    _instance = None
    _base = None
    _models = {}
    
    @classmethod
    def get_base(cls):
        if cls._base is None:
            cls._base = declarative_base()
        return cls._base

5. Real Files Testing

Why it worked: Caught actual edge cases, more reliable than mocks.

# Use actual audio files in tests - no mocks
@pytest.fixture
def sample_audio_5s():
    return Path("tests/fixtures/audio/sample_5s.wav")  # Real file

@pytest.fixture  
def sample_video_2m():
    return Path("tests/fixtures/audio/sample_2m.mp4")  # Real file

6. JSON + TXT Export Strategy

Why it worked: Reduced complexity by 80%, JSON for structure, TXT for human readability.

# Export strategy
def export_transcript(transcript: dict, format: str) -> str:
    if format == "json":
        return json.dumps(transcript, indent=2)
    elif format == "txt":
        return format_as_text(transcript)
    else:
        # Generate other formats from JSON
        return generate_from_json(transcript, format)

7. Enhanced CLI with Progress Reporting

Why it works: Provides real-time feedback, performance monitoring, and user-friendly error handling.

# Rich progress bars with real-time monitoring
with Progress(
    TextColumn("[bold blue]{task.description}"),
    BarColumn(),
    TaskProgressColumn(),
    TimeRemainingColumn(),
) as progress:
    task = progress.add_task("Transcribing...", total=100)
    
    # Real-time performance monitoring
    stats = get_performance_stats()
    console.print(f"CPU: {stats['cpu_percent']}% | Memory: {stats['memory_used_gb']}GB")

8. Protocol-Based CLI Architecture

Why it works: Enables easy testing, modular design, and service integration.

from typing import Protocol

class TranscriptionCommandProtocol(Protocol):
    async def execute_transcription(self, input_path: str, **kwargs) -> Optional[str]:
        pass

class EnhancedTranscribeCommand:
    async def execute_transcription(self, input_path: str, **kwargs) -> Optional[str]:
        # Implementation with progress reporting
        pass

Failed Patterns to Avoid

1. Streaming Processing

  • Network interruptions cause failures
  • Can't retry without full re-download
  • Much slower than local processing

2. Mock-Heavy Testing

  • Mocked services behave differently
  • Don't catch real edge cases
  • False confidence in tests

3. Parallel Frontend/Backend Development

  • Caused integration issues
  • Sequential development proved superior
  • Get data layer right before UI

4. Complex Export Formats

  • Multi-format system was high maintenance
  • JSON + TXT backup strategy worked best
  • Reduced complexity by 80%

5. Multiple Transcript Sources

  • Tried to merge YouTube captions with Whisper
  • Added complexity without quality improvement
  • Single source (Whisper) proved more reliable

Performance Optimizations That Worked

M3 Optimization

  • Model: distil-large-v3 (20-70x speed improvement)
  • Compute: int8_float32 for CPU optimization
  • Chunking: 10-minute segments with overlap

Audio Preprocessing

  • Sample Rate: 16kHz conversion (3x data reduction)
  • Channels: Mono conversion (2x data reduction)
  • VAD: Voice Activity Detection to skip silence

Caching Strategy

  • Embeddings: 24h TTL (stable for long periods)
  • Analysis: 7d TTL (expensive multi-agent results)
  • Queries: 6h TTL (RAG results)
  • Compression: LZ4 for storage efficiency

Critical Success Factors

For AI Code Generation Consistency

  1. Explicit Rules File: Like DATABASE_MODIFICATION_CHECKLIST.md
  2. Approval Gates: Each major change requires permission
  3. Test-First: Write test, then implementation
  4. Single Responsibility: One task at a time
  5. Context Limits: Keep docs under 600 LOC

For Media Processing Reliability

  1. Always Download First: Never stream
  2. Standardize Early: Convert to 16kHz mono WAV
  3. Chunk Large Files: 10-minute segments with overlap
  4. Cache Aggressively: Transcriptions are expensive
  5. Simple Formats: JSON + TXT only

For Project Success

  1. Backend-First: Get data layer right
  2. CLI Before GUI: Test via command line
  3. Modular Services: Each service independent
  4. Progressive Enhancement: Start simple, add features
  5. Document Decisions: Track why choices were made

Iterative Pipeline Architecture

Version Progression (v1 → v2 → v3 → v4)

v1: Basic Transcription (Weeks 1-2)

async def transcribe_v1(audio_path: Path) -> dict:
    # Preprocess to 16kHz mono
    processed = await preprocess_audio(audio_path)
    
    # Whisper with M3 optimizations
    transcript = await whisper.transcribe(
        processed,
        model="distil-large-v3",
        compute_type="int8_float32"
    )
    
    return format_transcript_json(transcript)
  • Targets: 95% accuracy, <30s for 5min audio, <2GB memory

v2: AI Enhancement (Week 3)

async def transcribe_v2(audio_path: Path) -> dict:
    transcript = await transcribe_v1(audio_path)
    enhanced = await enhance_with_ai(transcript, model="deepseek-chat")
    
    return {
        "raw": transcript,
        "enhanced": enhanced,
        "version": "v2"
    }
  • Targets: 99% accuracy, <35s processing, <5s enhancement time

v3: Multi-Pass Accuracy (Weeks 4-5)

async def transcribe_v3(audio_path: Path) -> dict:
    passes = []
    for i in range(3):
        transcript = await transcribe_v1_with_params(
            audio_path,
            temperature=0.0 + (0.2 * i),
            beam_size=2 + i
        )
        passes.append(transcript)
    
    merged = await merge_transcripts(passes, strategy="confidence_weighted")
    enhanced = await enhance_with_ai(merged)
    
    return {
        "raw": merged,
        "enhanced": enhanced,
        "passes": passes,
        "confidence_scores": calculate_confidence(passes),
        "version": "v3"
    }
  • Targets: 99.5% accuracy, <25s processing, confidence scores

v4: Speaker Diarization (Week 6+)

async def transcribe_v4(audio_path: Path) -> dict:
    transcript = await transcribe_v3(audio_path)
    diarization = await diarize_audio(audio_path, max_speakers=10)
    labeled_transcript = await assign_speakers(transcript, diarization)
    
    return {
        "transcript": labeled_transcript,
        "speakers": await create_speaker_profiles(audio_path, diarization),
        "diarization": diarization,
        "version": "v4"
    }
  • Targets: 90% speaker accuracy, <30s processing, max 10 speakers

Success Metrics

Technical KPIs

Metric v1 Target v2 Target v3 Target v4 Target
Accuracy 95% 99% 99.5% 99.5%
Speed (5min audio) <30s <35s <25s <30s
Batch capacity 10 files 50 files 100 files 100 files
Memory usage <2GB <2GB <3GB <4GB
Error rate <5% <3% <1% <1%

Business KPIs

  • Adoption: Active usage by Week 4
  • Reliability: 99% success rate after v2
  • Performance: 3x faster than YouTube Summarizer
  • Cost: <$0.01 per transcript with caching
  • Scale: Handle 1000+ files/day by v3

Development Workflow

Phase 1: Foundation (Weeks 1-2)

  • PostgreSQL database operational
  • Basic Whisper transcription working
  • Batch processing for 10+ files
  • JSON/TXT export functional
  • CLI with basic commands

Phase 2: Enhancement (Week 3)

  • DeepSeek integration complete
  • Enhancement templates working
  • Progress tracking implemented
  • Quality validation checks

Phase 3: Optimization (Weeks 4-5)

  • Multi-pass implementation
  • Confidence scoring system
  • Performance metrics dashboard
  • Batch optimization

Phase 4: Advanced Features (Week 6+)

  • Speaker diarization working
  • Voice embedding database
  • Caching layer operational

Last Updated: 2024
Patterns Version: 1.0