trax/docs/architecture/development-patterns.md

325 lines
9.6 KiB
Markdown

# Development Patterns & Historical Learnings
## Overview
This document captures the key architectural patterns and lessons learned from the YouTube Summarizer project that should be applied to Trax development.
## Successful Patterns to Implement
### 1. Download-First Architecture
**Why it worked**: Prevents streaming failures, enables retry without re-downloading, allows offline processing.
```python
# ALWAYS download media before processing - never stream
async def acquire_media(source: str) -> Path:
if source.startswith(('http://', 'https://')):
return await download_media(source) # Download to temp file
elif Path(source).exists():
return Path(source)
else:
raise ValueError(f"Invalid source: {source}")
```
### 2. Multi-Layer Caching System (90% cost reduction)
**Why it worked**: Different data has different lifespans, aggressive caching reduces API costs dramatically.
```python
# Different TTLs for different data types
cache_layers = {
'embedding': 86400, # 24h - embeddings are stable
'analysis': 604800, # 7d - multi-agent results
'query': 21600, # 6h - RAG query results
'prompt': 2592000 # 30d - prompt complexity
}
```
### 3. Protocol-Based Services
**Why it worked**: Easy swapping of implementations, maximum refactorability.
```python
from typing import Protocol
class TranscriptionProtocol(Protocol):
async def transcribe(self, audio: Path) -> dict:
pass
# Easy swapping of implementations
class FasterWhisperService:
async def transcribe(self, audio: Path) -> dict:
# Implementation
pass
```
### 4. Database Registry Pattern
**Why it worked**: Prevents SQLAlchemy "multiple classes" errors, centralized model registration.
```python
# Prevents SQLAlchemy "multiple classes" errors
class DatabaseRegistry:
_instance = None
_base = None
_models = {}
@classmethod
def get_base(cls):
if cls._base is None:
cls._base = declarative_base()
return cls._base
```
### 5. Real Files Testing
**Why it worked**: Caught actual edge cases, more reliable than mocks.
```python
# Use actual audio files in tests - no mocks
@pytest.fixture
def sample_audio_5s():
return Path("tests/fixtures/audio/sample_5s.wav") # Real file
@pytest.fixture
def sample_video_2m():
return Path("tests/fixtures/audio/sample_2m.mp4") # Real file
```
### 6. JSON + TXT Export Strategy
**Why it worked**: Reduced complexity by 80%, JSON for structure, TXT for human readability.
```python
# Export strategy
def export_transcript(transcript: dict, format: str) -> str:
if format == "json":
return json.dumps(transcript, indent=2)
elif format == "txt":
return format_as_text(transcript)
else:
# Generate other formats from JSON
return generate_from_json(transcript, format)
```
### 7. Enhanced CLI with Progress Reporting
**Why it works**: Provides real-time feedback, performance monitoring, and user-friendly error handling.
```python
# Rich progress bars with real-time monitoring
with Progress(
TextColumn("[bold blue]{task.description}"),
BarColumn(),
TaskProgressColumn(),
TimeRemainingColumn(),
) as progress:
task = progress.add_task("Transcribing...", total=100)
# Real-time performance monitoring
stats = get_performance_stats()
console.print(f"CPU: {stats['cpu_percent']}% | Memory: {stats['memory_used_gb']}GB")
```
### 8. Protocol-Based CLI Architecture
**Why it works**: Enables easy testing, modular design, and service integration.
```python
from typing import Protocol
class TranscriptionCommandProtocol(Protocol):
async def execute_transcription(self, input_path: str, **kwargs) -> Optional[str]:
pass
class EnhancedTranscribeCommand:
async def execute_transcription(self, input_path: str, **kwargs) -> Optional[str]:
# Implementation with progress reporting
pass
```
## Failed Patterns to Avoid
### 1. Streaming Processing
- Network interruptions cause failures
- Can't retry without full re-download
- Much slower than local processing
### 2. Mock-Heavy Testing
- Mocked services behave differently
- Don't catch real edge cases
- False confidence in tests
### 3. Parallel Frontend/Backend Development
- Caused integration issues
- Sequential development proved superior
- Get data layer right before UI
### 4. Complex Export Formats
- Multi-format system was high maintenance
- JSON + TXT backup strategy worked best
- Reduced complexity by 80%
### 5. Multiple Transcript Sources
- Tried to merge YouTube captions with Whisper
- Added complexity without quality improvement
- Single source (Whisper) proved more reliable
## Performance Optimizations That Worked
### M3 Optimization
- **Model**: distil-large-v3 (20-70x speed improvement)
- **Compute**: int8_float32 for CPU optimization
- **Chunking**: 10-minute segments with overlap
### Audio Preprocessing
- **Sample Rate**: 16kHz conversion (3x data reduction)
- **Channels**: Mono conversion (2x data reduction)
- **VAD**: Voice Activity Detection to skip silence
### Caching Strategy
- **Embeddings**: 24h TTL (stable for long periods)
- **Analysis**: 7d TTL (expensive multi-agent results)
- **Queries**: 6h TTL (RAG results)
- **Compression**: LZ4 for storage efficiency
## Critical Success Factors
### For AI Code Generation Consistency
1. **Explicit Rules File**: Like DATABASE_MODIFICATION_CHECKLIST.md
2. **Approval Gates**: Each major change requires permission
3. **Test-First**: Write test, then implementation
4. **Single Responsibility**: One task at a time
5. **Context Limits**: Keep docs under 600 LOC
### For Media Processing Reliability
1. **Always Download First**: Never stream
2. **Standardize Early**: Convert to 16kHz mono WAV
3. **Chunk Large Files**: 10-minute segments with overlap
4. **Cache Aggressively**: Transcriptions are expensive
5. **Simple Formats**: JSON + TXT only
### For Project Success
1. **Backend-First**: Get data layer right
2. **CLI Before GUI**: Test via command line
3. **Modular Services**: Each service independent
4. **Progressive Enhancement**: Start simple, add features
5. **Document Decisions**: Track why choices were made
## Iterative Pipeline Architecture
### Version Progression (v1 → v2 → v3 → v4)
**v1: Basic Transcription (Weeks 1-2)**
```python
async def transcribe_v1(audio_path: Path) -> dict:
# Preprocess to 16kHz mono
processed = await preprocess_audio(audio_path)
# Whisper with M3 optimizations
transcript = await whisper.transcribe(
processed,
model="distil-large-v3",
compute_type="int8_float32"
)
return format_transcript_json(transcript)
```
- **Targets**: 95% accuracy, <30s for 5min audio, <2GB memory
**v2: AI Enhancement (Week 3)**
```python
async def transcribe_v2(audio_path: Path) -> dict:
transcript = await transcribe_v1(audio_path)
enhanced = await enhance_with_ai(transcript, model="deepseek-chat")
return {
"raw": transcript,
"enhanced": enhanced,
"version": "v2"
}
```
- **Targets**: 99% accuracy, <35s processing, <5s enhancement time
**v3: Multi-Pass Accuracy (Weeks 4-5)**
```python
async def transcribe_v3(audio_path: Path) -> dict:
passes = []
for i in range(3):
transcript = await transcribe_v1_with_params(
audio_path,
temperature=0.0 + (0.2 * i),
beam_size=2 + i
)
passes.append(transcript)
merged = await merge_transcripts(passes, strategy="confidence_weighted")
enhanced = await enhance_with_ai(merged)
return {
"raw": merged,
"enhanced": enhanced,
"passes": passes,
"confidence_scores": calculate_confidence(passes),
"version": "v3"
}
```
- **Targets**: 99.5% accuracy, <25s processing, confidence scores
**v4: Speaker Diarization (Week 6+)**
```python
async def transcribe_v4(audio_path: Path) -> dict:
transcript = await transcribe_v3(audio_path)
diarization = await diarize_audio(audio_path, max_speakers=10)
labeled_transcript = await assign_speakers(transcript, diarization)
return {
"transcript": labeled_transcript,
"speakers": await create_speaker_profiles(audio_path, diarization),
"diarization": diarization,
"version": "v4"
}
```
- **Targets**: 90% speaker accuracy, <30s processing, max 10 speakers
## Success Metrics
### Technical KPIs
| Metric | v1 Target | v2 Target | v3 Target | v4 Target |
|--------|-----------|-----------|-----------|-----------|
| **Accuracy** | 95% | 99% | 99.5% | 99.5% |
| **Speed (5min audio)** | <30s | <35s | <25s | <30s |
| **Batch capacity** | 10 files | 50 files | 100 files | 100 files |
| **Memory usage** | <2GB | <2GB | <3GB | <4GB |
| **Error rate** | <5% | <3% | <1% | <1% |
### Business KPIs
- **Adoption**: Active usage by Week 4
- **Reliability**: 99% success rate after v2
- **Performance**: 3x faster than YouTube Summarizer
- **Cost**: <$0.01 per transcript with caching
- **Scale**: Handle 1000+ files/day by v3
## Development Workflow
### Phase 1: Foundation (Weeks 1-2)
- PostgreSQL database operational
- Basic Whisper transcription working
- Batch processing for 10+ files
- JSON/TXT export functional
- CLI with basic commands
### Phase 2: Enhancement (Week 3)
- DeepSeek integration complete
- Enhancement templates working
- Progress tracking implemented
- Quality validation checks
### Phase 3: Optimization (Weeks 4-5)
- Multi-pass implementation
- Confidence scoring system
- Performance metrics dashboard
- Batch optimization
### Phase 4: Advanced Features (Week 6+)
- Speaker diarization working
- Voice embedding database
- Caching layer operational
---
*Last Updated: 2024*
*Patterns Version: 1.0*