trax/docs/architecture/development-patterns.md

# Development Patterns & Historical Learnings

## Overview

This document captures the key architectural patterns and lessons learned from the YouTube Summarizer project that should be applied to Trax development.

## Successful Patterns to Implement

### 1. Download-First Architecture
**Why it worked**: Prevents streaming failures, enables retry without re-downloading, allows offline processing.

```python
# ALWAYS download media before processing - never stream
async def acquire_media(source: str) -> Path:
    if source.startswith(('http://', 'https://')):
        return await download_media(source)  # Download to temp file
    elif Path(source).exists():
        return Path(source)
    else:
        raise ValueError(f"Invalid source: {source}")
```

### 2. Multi-Layer Caching System (90% cost reduction)
**Why it worked**: Different data has different lifespans, aggressive caching reduces API costs dramatically.

```python
# Different TTLs for different data types
cache_layers = {
    'embedding': 86400,      # 24h - embeddings are stable
    'analysis': 604800,      # 7d - multi-agent results
    'query': 21600,          # 6h - RAG query results
    'prompt': 2592000        # 30d - prompt complexity
}
```

### 3. Protocol-Based Services
**Why it worked**: Easy swapping of implementations, maximum refactorability.

```python
from typing import Protocol

class TranscriptionProtocol(Protocol):
    async def transcribe(self, audio: Path) -> dict:
        pass

# Easy swapping of implementations
class FasterWhisperService:
    async def transcribe(self, audio: Path) -> dict:
        # Implementation
        pass
```

### 4. Database Registry Pattern
**Why it worked**: Prevents SQLAlchemy "multiple classes" errors, centralized model registration.

```python
# Prevents SQLAlchemy "multiple classes" errors
class DatabaseRegistry:
    _instance = None
    _base = None
    _models = {}

    @classmethod
    def get_base(cls):
        if cls._base is None:
            cls._base = declarative_base()
        return cls._base
```

### 5. Real Files Testing
**Why it worked**: Caught actual edge cases, more reliable than mocks.

```python
# Use actual audio files in tests - no mocks
@pytest.fixture
def sample_audio_5s():
    return Path("tests/fixtures/audio/sample_5s.wav")  # Real file

@pytest.fixture
def sample_video_2m():
    return Path("tests/fixtures/audio/sample_2m.mp4")  # Real file
```

### 6. JSON + TXT Export Strategy
**Why it worked**: Reduced complexity by 80%, JSON for structure, TXT for human readability.

```python
# Export strategy
def export_transcript(transcript: dict, format: str) -> str:
    if format == "json":
        return json.dumps(transcript, indent=2)
    elif format == "txt":
        return format_as_text(transcript)
    else:
        # Generate other formats from JSON
        return generate_from_json(transcript, format)
```

### 7. Enhanced CLI with Progress Reporting
**Why it works**: Provides real-time feedback, performance monitoring, and user-friendly error handling.

```python
# Rich progress bars with real-time monitoring
with Progress(
    TextColumn("[bold blue]{task.description}"),
    BarColumn(),
    TaskProgressColumn(),
    TimeRemainingColumn(),
) as progress:
    task = progress.add_task("Transcribing...", total=100)

    # Real-time performance monitoring
    stats = get_performance_stats()
    console.print(f"CPU: {stats['cpu_percent']}% | Memory: {stats['memory_used_gb']}GB")
```

### 8. Protocol-Based CLI Architecture
**Why it works**: Enables easy testing, modular design, and service integration.

```python
from typing import Protocol

class TranscriptionCommandProtocol(Protocol):
    async def execute_transcription(self, input_path: str, **kwargs) -> Optional[str]:
        pass

class EnhancedTranscribeCommand:
    async def execute_transcription(self, input_path: str, **kwargs) -> Optional[str]:
        # Implementation with progress reporting
        pass
```

## Failed Patterns to Avoid

### 1. Streaming Processing
- Network interruptions cause failures
- Can't retry without full re-download
- Much slower than local processing

### 2. Mock-Heavy Testing
- Mocked services behave differently
- Don't catch real edge cases
- False confidence in tests

### 3. Parallel Frontend/Backend Development
- Caused integration issues
- Sequential development proved superior
- Get data layer right before UI

### 4. Complex Export Formats
- Multi-format system was high maintenance
- JSON + TXT backup strategy worked best
- Reduced complexity by 80%

### 5. Multiple Transcript Sources
- Tried to merge YouTube captions with Whisper
- Added complexity without quality improvement
- Single source (Whisper) proved more reliable

## Performance Optimizations That Worked

### M3 Optimization
- **Model**: distil-large-v3 (20-70x speed improvement)
- **Compute**: int8_float32 for CPU optimization
- **Chunking**: 10-minute segments with overlap

### Audio Preprocessing
- **Sample Rate**: 16kHz conversion (3x data reduction)
- **Channels**: Mono conversion (2x data reduction)
- **VAD**: Voice Activity Detection to skip silence

### Caching Strategy
- **Embeddings**: 24h TTL (stable for long periods)
- **Analysis**: 7d TTL (expensive multi-agent results)
- **Queries**: 6h TTL (RAG results)
- **Compression**: LZ4 for storage efficiency

## Critical Success Factors

### For AI Code Generation Consistency
1. **Explicit Rules File**: Like DATABASE_MODIFICATION_CHECKLIST.md
2. **Approval Gates**: Each major change requires permission
3. **Test-First**: Write test, then implementation
4. **Single Responsibility**: One task at a time
5. **Context Limits**: Keep docs under 600 LOC

### For Media Processing Reliability
1. **Always Download First**: Never stream
2. **Standardize Early**: Convert to 16kHz mono WAV
3. **Chunk Large Files**: 10-minute segments with overlap
4. **Cache Aggressively**: Transcriptions are expensive
5. **Simple Formats**: JSON + TXT only

### For Project Success
1. **Backend-First**: Get data layer right
2. **CLI Before GUI**: Test via command line
3. **Modular Services**: Each service independent
4. **Progressive Enhancement**: Start simple, add features
5. **Document Decisions**: Track why choices were made

## Iterative Pipeline Architecture

### Version Progression (v1 → v2 → v3 → v4)

**v1: Basic Transcription (Weeks 1-2)**
```python
async def transcribe_v1(audio_path: Path) -> dict:
    # Preprocess to 16kHz mono
    processed = await preprocess_audio(audio_path)

    # Whisper with M3 optimizations
    transcript = await whisper.transcribe(
        processed,
        model="distil-large-v3",
        compute_type="int8_float32"
    )

    return format_transcript_json(transcript)
```
- **Targets**: 95% accuracy, <30s for 5min audio, <2GB memory

**v2: AI Enhancement (Week 3)**
```python
async def transcribe_v2(audio_path: Path) -> dict:
    transcript = await transcribe_v1(audio_path)
    enhanced = await enhance_with_ai(transcript, model="deepseek-chat")

    return {
        "raw": transcript,
        "enhanced": enhanced,
        "version": "v2"
    }
```
- **Targets**: 99% accuracy, <35s processing, <5s enhancement time

**v3: Multi-Pass Accuracy (Weeks 4-5)**
```python
async def transcribe_v3(audio_path: Path) -> dict:
    passes = []
    for i in range(3):
        transcript = await transcribe_v1_with_params(
            audio_path,
            temperature=0.0 + (0.2 * i),
            beam_size=2 + i
        )
        passes.append(transcript)

    merged = await merge_transcripts(passes, strategy="confidence_weighted")
    enhanced = await enhance_with_ai(merged)

    return {
        "raw": merged,
        "enhanced": enhanced,
        "passes": passes,
        "confidence_scores": calculate_confidence(passes),
        "version": "v3"
    }
```
- **Targets**: 99.5% accuracy, <25s processing, confidence scores

**v4: Speaker Diarization (Week 6+)**
```python
async def transcribe_v4(audio_path: Path) -> dict:
    transcript = await transcribe_v3(audio_path)
    diarization = await diarize_audio(audio_path, max_speakers=10)
    labeled_transcript = await assign_speakers(transcript, diarization)

    return {
        "transcript": labeled_transcript,
        "speakers": await create_speaker_profiles(audio_path, diarization),
        "diarization": diarization,
        "version": "v4"
    }
```
- **Targets**: 90% speaker accuracy, <30s processing, max 10 speakers

## Success Metrics

### Technical KPIs
| Metric | v1 Target | v2 Target | v3 Target | v4 Target |
|--------|-----------|-----------|-----------|-----------|
| **Accuracy** | 95% | 99% | 99.5% | 99.5% |
| **Speed (5min audio)** | <30s | <35s | <25s | <30s |
| **Batch capacity** | 10 files | 50 files | 100 files | 100 files |
| **Memory usage** | <2GB | <2GB | <3GB | <4GB |
| **Error rate** | <5% | <3% | <1% | <1% |

### Business KPIs
- **Adoption**: Active usage by Week 4
- **Reliability**: 99% success rate after v2
- **Performance**: 3x faster than YouTube Summarizer
- **Cost**: <$0.01 per transcript with caching
- **Scale**: Handle 1000+ files/day by v3

## Development Workflow

### Phase 1: Foundation (Weeks 1-2)
- PostgreSQL database operational
- Basic Whisper transcription working
- Batch processing for 10+ files
- JSON/TXT export functional
- CLI with basic commands

### Phase 2: Enhancement (Week 3)
- DeepSeek integration complete
- Enhancement templates working
- Progress tracking implemented
- Quality validation checks

### Phase 3: Optimization (Weeks 4-5)
- Multi-pass implementation
- Confidence scoring system
- Performance metrics dashboard
- Batch optimization

### Phase 4: Advanced Features (Week 6+)
- Speaker diarization working
- Voice embedding database
- Caching layer operational

---

*Last Updated: 2024*
*Patterns Version: 1.0*