9.6 KiB

Raw Permalink Blame History

Development Patterns & Historical Learnings

Overview

This document captures the key architectural patterns and lessons learned from the YouTube Summarizer project that should be applied to Trax development.

Successful Patterns to Implement

1. Download-First Architecture

Why it worked: Prevents streaming failures, enables retry without re-downloading, allows offline processing.

# ALWAYS download media before processing - never stream
async def acquire_media(source: str) -> Path:
    if source.startswith(('http://', 'https://')):
        return await download_media(source)  # Download to temp file
    elif Path(source).exists():
        return Path(source)
    else:
        raise ValueError(f"Invalid source: {source}")

2. Multi-Layer Caching System (90% cost reduction)

Why it worked: Different data has different lifespans, aggressive caching reduces API costs dramatically.

# Different TTLs for different data types
cache_layers = {
    'embedding': 86400,      # 24h - embeddings are stable
    'analysis': 604800,      # 7d - multi-agent results
    'query': 21600,          # 6h - RAG query results
    'prompt': 2592000        # 30d - prompt complexity
}

3. Protocol-Based Services

Why it worked: Easy swapping of implementations, maximum refactorability.

from typing import Protocol

class TranscriptionProtocol(Protocol):
    async def transcribe(self, audio: Path) -> dict:
        pass

# Easy swapping of implementations
class FasterWhisperService:
    async def transcribe(self, audio: Path) -> dict:
        # Implementation
        pass

4. Database Registry Pattern

Why it worked: Prevents SQLAlchemy "multiple classes" errors, centralized model registration.

# Prevents SQLAlchemy "multiple classes" errors
class DatabaseRegistry:
    _instance = None
    _base = None
    _models = {}
    
    @classmethod
    def get_base(cls):
        if cls._base is None:
            cls._base = declarative_base()
        return cls._base

5. Real Files Testing

Why it worked: Caught actual edge cases, more reliable than mocks.

# Use actual audio files in tests - no mocks
@pytest.fixture
def sample_audio_5s():
    return Path("tests/fixtures/audio/sample_5s.wav")  # Real file

@pytest.fixture  
def sample_video_2m():
    return Path("tests/fixtures/audio/sample_2m.mp4")  # Real file

6. JSON + TXT Export Strategy

Why it worked: Reduced complexity by 80%, JSON for structure, TXT for human readability.

# Export strategy
def export_transcript(transcript: dict, format: str) -> str:
    if format == "json":
        return json.dumps(transcript, indent=2)
    elif format == "txt":
        return format_as_text(transcript)
    else:
        # Generate other formats from JSON
        return generate_from_json(transcript, format)

7. Enhanced CLI with Progress Reporting

Why it works: Provides real-time feedback, performance monitoring, and user-friendly error handling.

# Rich progress bars with real-time monitoring
with Progress(
    TextColumn("[bold blue]{task.description}"),
    BarColumn(),
    TaskProgressColumn(),
    TimeRemainingColumn(),
) as progress:
    task = progress.add_task("Transcribing...", total=100)
    
    # Real-time performance monitoring
    stats = get_performance_stats()
    console.print(f"CPU: {stats['cpu_percent']}% | Memory: {stats['memory_used_gb']}GB")

8. Protocol-Based CLI Architecture

Why it works: Enables easy testing, modular design, and service integration.

from typing import Protocol

class TranscriptionCommandProtocol(Protocol):
    async def execute_transcription(self, input_path: str, **kwargs) -> Optional[str]:
        pass

class EnhancedTranscribeCommand:
    async def execute_transcription(self, input_path: str, **kwargs) -> Optional[str]:
        # Implementation with progress reporting
        pass

Failed Patterns to Avoid

1. Streaming Processing

Network interruptions cause failures
Can't retry without full re-download
Much slower than local processing

2. Mock-Heavy Testing

Mocked services behave differently
Don't catch real edge cases
False confidence in tests

3. Parallel Frontend/Backend Development

Caused integration issues
Sequential development proved superior
Get data layer right before UI

4. Complex Export Formats

Multi-format system was high maintenance
JSON + TXT backup strategy worked best
Reduced complexity by 80%

5. Multiple Transcript Sources

Tried to merge YouTube captions with Whisper
Added complexity without quality improvement
Single source (Whisper) proved more reliable

Performance Optimizations That Worked

M3 Optimization

Model: distil-large-v3 (20-70x speed improvement)
Compute: int8_float32 for CPU optimization
Chunking: 10-minute segments with overlap

Audio Preprocessing

Sample Rate: 16kHz conversion (3x data reduction)
Channels: Mono conversion (2x data reduction)
VAD: Voice Activity Detection to skip silence

Caching Strategy

Embeddings: 24h TTL (stable for long periods)
Analysis: 7d TTL (expensive multi-agent results)
Queries: 6h TTL (RAG results)
Compression: LZ4 for storage efficiency

Critical Success Factors

For AI Code Generation Consistency

Explicit Rules File: Like DATABASE_MODIFICATION_CHECKLIST.md
Approval Gates: Each major change requires permission
Test-First: Write test, then implementation
Single Responsibility: One task at a time
Context Limits: Keep docs under 600 LOC

For Media Processing Reliability

Always Download First: Never stream
Standardize Early: Convert to 16kHz mono WAV
Chunk Large Files: 10-minute segments with overlap
Cache Aggressively: Transcriptions are expensive
Simple Formats: JSON + TXT only

For Project Success

Backend-First: Get data layer right
CLI Before GUI: Test via command line
Modular Services: Each service independent
Progressive Enhancement: Start simple, add features
Document Decisions: Track why choices were made

Iterative Pipeline Architecture

Version Progression (v1 → v2 → v3 → v4)

v1: Basic Transcription (Weeks 1-2)

async def transcribe_v1(audio_path: Path) -> dict:
    # Preprocess to 16kHz mono
    processed = await preprocess_audio(audio_path)
    
    # Whisper with M3 optimizations
    transcript = await whisper.transcribe(
        processed,
        model="distil-large-v3",
        compute_type="int8_float32"
    )
    
    return format_transcript_json(transcript)

Targets: 95% accuracy, <30s for 5min audio, <2GB memory

v2: AI Enhancement (Week 3)

async def transcribe_v2(audio_path: Path) -> dict:
    transcript = await transcribe_v1(audio_path)
    enhanced = await enhance_with_ai(transcript, model="deepseek-chat")
    
    return {
        "raw": transcript,
        "enhanced": enhanced,
        "version": "v2"
    }

Targets: 99% accuracy, <35s processing, <5s enhancement time

v3: Multi-Pass Accuracy (Weeks 4-5)

async def transcribe_v3(audio_path: Path) -> dict:
    passes = []
    for i in range(3):
        transcript = await transcribe_v1_with_params(
            audio_path,
            temperature=0.0 + (0.2 * i),
            beam_size=2 + i
        )
        passes.append(transcript)
    
    merged = await merge_transcripts(passes, strategy="confidence_weighted")
    enhanced = await enhance_with_ai(merged)
    
    return {
        "raw": merged,
        "enhanced": enhanced,
        "passes": passes,
        "confidence_scores": calculate_confidence(passes),
        "version": "v3"
    }

Targets: 99.5% accuracy, <25s processing, confidence scores

v4: Speaker Diarization (Week 6+)

async def transcribe_v4(audio_path: Path) -> dict:
    transcript = await transcribe_v3(audio_path)
    diarization = await diarize_audio(audio_path, max_speakers=10)
    labeled_transcript = await assign_speakers(transcript, diarization)
    
    return {
        "transcript": labeled_transcript,
        "speakers": await create_speaker_profiles(audio_path, diarization),
        "diarization": diarization,
        "version": "v4"
    }

Targets: 90% speaker accuracy, <30s processing, max 10 speakers

Success Metrics

Technical KPIs

Metric	v1 Target	v2 Target	v3 Target	v4 Target
Accuracy	95%	99%	99.5%	99.5%
Speed (5min audio)	<30s	<35s	<25s	<30s
Batch capacity	10 files	50 files	100 files	100 files
Memory usage	<2GB	<2GB	<3GB	<4GB
Error rate	<5%	<3%	<1%	<1%

Business KPIs

Adoption: Active usage by Week 4
Reliability: 99% success rate after v2
Performance: 3x faster than YouTube Summarizer
Cost: <$0.01 per transcript with caching
Scale: Handle 1000+ files/day by v3

Development Workflow

Phase 1: Foundation (Weeks 1-2)

PostgreSQL database operational
Basic Whisper transcription working
Batch processing for 10+ files
JSON/TXT export functional
CLI with basic commands

Phase 2: Enhancement (Week 3)

DeepSeek integration complete
Enhancement templates working
Progress tracking implemented
Quality validation checks

Phase 3: Optimization (Weeks 4-5)

Multi-pass implementation
Confidence scoring system
Performance metrics dashboard
Batch optimization

Phase 4: Advanced Features (Week 6+)

Speaker diarization working
Voice embedding database
Caching layer operational

Last Updated: 2024
Patterns Version: 1.0

9.6 KiB Raw Permalink Blame History