325 lines
9.6 KiB
Markdown
325 lines
9.6 KiB
Markdown
# Development Patterns & Historical Learnings
|
|
|
|
## Overview
|
|
|
|
This document captures the key architectural patterns and lessons learned from the YouTube Summarizer project that should be applied to Trax development.
|
|
|
|
## Successful Patterns to Implement
|
|
|
|
### 1. Download-First Architecture
|
|
**Why it worked**: Prevents streaming failures, enables retry without re-downloading, allows offline processing.
|
|
|
|
```python
|
|
# ALWAYS download media before processing - never stream
|
|
async def acquire_media(source: str) -> Path:
|
|
if source.startswith(('http://', 'https://')):
|
|
return await download_media(source) # Download to temp file
|
|
elif Path(source).exists():
|
|
return Path(source)
|
|
else:
|
|
raise ValueError(f"Invalid source: {source}")
|
|
```
|
|
|
|
### 2. Multi-Layer Caching System (90% cost reduction)
|
|
**Why it worked**: Different data has different lifespans, aggressive caching reduces API costs dramatically.
|
|
|
|
```python
|
|
# Different TTLs for different data types
|
|
cache_layers = {
|
|
'embedding': 86400, # 24h - embeddings are stable
|
|
'analysis': 604800, # 7d - multi-agent results
|
|
'query': 21600, # 6h - RAG query results
|
|
'prompt': 2592000 # 30d - prompt complexity
|
|
}
|
|
```
|
|
|
|
### 3. Protocol-Based Services
|
|
**Why it worked**: Easy swapping of implementations, maximum refactorability.
|
|
|
|
```python
|
|
from typing import Protocol
|
|
|
|
class TranscriptionProtocol(Protocol):
|
|
async def transcribe(self, audio: Path) -> dict:
|
|
pass
|
|
|
|
# Easy swapping of implementations
|
|
class FasterWhisperService:
|
|
async def transcribe(self, audio: Path) -> dict:
|
|
# Implementation
|
|
pass
|
|
```
|
|
|
|
### 4. Database Registry Pattern
|
|
**Why it worked**: Prevents SQLAlchemy "multiple classes" errors, centralized model registration.
|
|
|
|
```python
|
|
# Prevents SQLAlchemy "multiple classes" errors
|
|
class DatabaseRegistry:
|
|
_instance = None
|
|
_base = None
|
|
_models = {}
|
|
|
|
@classmethod
|
|
def get_base(cls):
|
|
if cls._base is None:
|
|
cls._base = declarative_base()
|
|
return cls._base
|
|
```
|
|
|
|
### 5. Real Files Testing
|
|
**Why it worked**: Caught actual edge cases, more reliable than mocks.
|
|
|
|
```python
|
|
# Use actual audio files in tests - no mocks
|
|
@pytest.fixture
|
|
def sample_audio_5s():
|
|
return Path("tests/fixtures/audio/sample_5s.wav") # Real file
|
|
|
|
@pytest.fixture
|
|
def sample_video_2m():
|
|
return Path("tests/fixtures/audio/sample_2m.mp4") # Real file
|
|
```
|
|
|
|
### 6. JSON + TXT Export Strategy
|
|
**Why it worked**: Reduced complexity by 80%, JSON for structure, TXT for human readability.
|
|
|
|
```python
|
|
# Export strategy
|
|
def export_transcript(transcript: dict, format: str) -> str:
|
|
if format == "json":
|
|
return json.dumps(transcript, indent=2)
|
|
elif format == "txt":
|
|
return format_as_text(transcript)
|
|
else:
|
|
# Generate other formats from JSON
|
|
return generate_from_json(transcript, format)
|
|
```
|
|
|
|
### 7. Enhanced CLI with Progress Reporting
|
|
**Why it works**: Provides real-time feedback, performance monitoring, and user-friendly error handling.
|
|
|
|
```python
|
|
# Rich progress bars with real-time monitoring
|
|
with Progress(
|
|
TextColumn("[bold blue]{task.description}"),
|
|
BarColumn(),
|
|
TaskProgressColumn(),
|
|
TimeRemainingColumn(),
|
|
) as progress:
|
|
task = progress.add_task("Transcribing...", total=100)
|
|
|
|
# Real-time performance monitoring
|
|
stats = get_performance_stats()
|
|
console.print(f"CPU: {stats['cpu_percent']}% | Memory: {stats['memory_used_gb']}GB")
|
|
```
|
|
|
|
### 8. Protocol-Based CLI Architecture
|
|
**Why it works**: Enables easy testing, modular design, and service integration.
|
|
|
|
```python
|
|
from typing import Protocol
|
|
|
|
class TranscriptionCommandProtocol(Protocol):
|
|
async def execute_transcription(self, input_path: str, **kwargs) -> Optional[str]:
|
|
pass
|
|
|
|
class EnhancedTranscribeCommand:
|
|
async def execute_transcription(self, input_path: str, **kwargs) -> Optional[str]:
|
|
# Implementation with progress reporting
|
|
pass
|
|
```
|
|
|
|
## Failed Patterns to Avoid
|
|
|
|
### 1. Streaming Processing
|
|
- Network interruptions cause failures
|
|
- Can't retry without full re-download
|
|
- Much slower than local processing
|
|
|
|
### 2. Mock-Heavy Testing
|
|
- Mocked services behave differently
|
|
- Don't catch real edge cases
|
|
- False confidence in tests
|
|
|
|
### 3. Parallel Frontend/Backend Development
|
|
- Caused integration issues
|
|
- Sequential development proved superior
|
|
- Get data layer right before UI
|
|
|
|
### 4. Complex Export Formats
|
|
- Multi-format system was high maintenance
|
|
- JSON + TXT backup strategy worked best
|
|
- Reduced complexity by 80%
|
|
|
|
### 5. Multiple Transcript Sources
|
|
- Tried to merge YouTube captions with Whisper
|
|
- Added complexity without quality improvement
|
|
- Single source (Whisper) proved more reliable
|
|
|
|
## Performance Optimizations That Worked
|
|
|
|
### M3 Optimization
|
|
- **Model**: distil-large-v3 (20-70x speed improvement)
|
|
- **Compute**: int8_float32 for CPU optimization
|
|
- **Chunking**: 10-minute segments with overlap
|
|
|
|
### Audio Preprocessing
|
|
- **Sample Rate**: 16kHz conversion (3x data reduction)
|
|
- **Channels**: Mono conversion (2x data reduction)
|
|
- **VAD**: Voice Activity Detection to skip silence
|
|
|
|
### Caching Strategy
|
|
- **Embeddings**: 24h TTL (stable for long periods)
|
|
- **Analysis**: 7d TTL (expensive multi-agent results)
|
|
- **Queries**: 6h TTL (RAG results)
|
|
- **Compression**: LZ4 for storage efficiency
|
|
|
|
## Critical Success Factors
|
|
|
|
### For AI Code Generation Consistency
|
|
1. **Explicit Rules File**: Like DATABASE_MODIFICATION_CHECKLIST.md
|
|
2. **Approval Gates**: Each major change requires permission
|
|
3. **Test-First**: Write test, then implementation
|
|
4. **Single Responsibility**: One task at a time
|
|
5. **Context Limits**: Keep docs under 600 LOC
|
|
|
|
### For Media Processing Reliability
|
|
1. **Always Download First**: Never stream
|
|
2. **Standardize Early**: Convert to 16kHz mono WAV
|
|
3. **Chunk Large Files**: 10-minute segments with overlap
|
|
4. **Cache Aggressively**: Transcriptions are expensive
|
|
5. **Simple Formats**: JSON + TXT only
|
|
|
|
### For Project Success
|
|
1. **Backend-First**: Get data layer right
|
|
2. **CLI Before GUI**: Test via command line
|
|
3. **Modular Services**: Each service independent
|
|
4. **Progressive Enhancement**: Start simple, add features
|
|
5. **Document Decisions**: Track why choices were made
|
|
|
|
## Iterative Pipeline Architecture
|
|
|
|
### Version Progression (v1 → v2 → v3 → v4)
|
|
|
|
**v1: Basic Transcription (Weeks 1-2)**
|
|
```python
|
|
async def transcribe_v1(audio_path: Path) -> dict:
|
|
# Preprocess to 16kHz mono
|
|
processed = await preprocess_audio(audio_path)
|
|
|
|
# Whisper with M3 optimizations
|
|
transcript = await whisper.transcribe(
|
|
processed,
|
|
model="distil-large-v3",
|
|
compute_type="int8_float32"
|
|
)
|
|
|
|
return format_transcript_json(transcript)
|
|
```
|
|
- **Targets**: 95% accuracy, <30s for 5min audio, <2GB memory
|
|
|
|
**v2: AI Enhancement (Week 3)**
|
|
```python
|
|
async def transcribe_v2(audio_path: Path) -> dict:
|
|
transcript = await transcribe_v1(audio_path)
|
|
enhanced = await enhance_with_ai(transcript, model="deepseek-chat")
|
|
|
|
return {
|
|
"raw": transcript,
|
|
"enhanced": enhanced,
|
|
"version": "v2"
|
|
}
|
|
```
|
|
- **Targets**: 99% accuracy, <35s processing, <5s enhancement time
|
|
|
|
**v3: Multi-Pass Accuracy (Weeks 4-5)**
|
|
```python
|
|
async def transcribe_v3(audio_path: Path) -> dict:
|
|
passes = []
|
|
for i in range(3):
|
|
transcript = await transcribe_v1_with_params(
|
|
audio_path,
|
|
temperature=0.0 + (0.2 * i),
|
|
beam_size=2 + i
|
|
)
|
|
passes.append(transcript)
|
|
|
|
merged = await merge_transcripts(passes, strategy="confidence_weighted")
|
|
enhanced = await enhance_with_ai(merged)
|
|
|
|
return {
|
|
"raw": merged,
|
|
"enhanced": enhanced,
|
|
"passes": passes,
|
|
"confidence_scores": calculate_confidence(passes),
|
|
"version": "v3"
|
|
}
|
|
```
|
|
- **Targets**: 99.5% accuracy, <25s processing, confidence scores
|
|
|
|
**v4: Speaker Diarization (Week 6+)**
|
|
```python
|
|
async def transcribe_v4(audio_path: Path) -> dict:
|
|
transcript = await transcribe_v3(audio_path)
|
|
diarization = await diarize_audio(audio_path, max_speakers=10)
|
|
labeled_transcript = await assign_speakers(transcript, diarization)
|
|
|
|
return {
|
|
"transcript": labeled_transcript,
|
|
"speakers": await create_speaker_profiles(audio_path, diarization),
|
|
"diarization": diarization,
|
|
"version": "v4"
|
|
}
|
|
```
|
|
- **Targets**: 90% speaker accuracy, <30s processing, max 10 speakers
|
|
|
|
## Success Metrics
|
|
|
|
### Technical KPIs
|
|
| Metric | v1 Target | v2 Target | v3 Target | v4 Target |
|
|
|--------|-----------|-----------|-----------|-----------|
|
|
| **Accuracy** | 95% | 99% | 99.5% | 99.5% |
|
|
| **Speed (5min audio)** | <30s | <35s | <25s | <30s |
|
|
| **Batch capacity** | 10 files | 50 files | 100 files | 100 files |
|
|
| **Memory usage** | <2GB | <2GB | <3GB | <4GB |
|
|
| **Error rate** | <5% | <3% | <1% | <1% |
|
|
|
|
### Business KPIs
|
|
- **Adoption**: Active usage by Week 4
|
|
- **Reliability**: 99% success rate after v2
|
|
- **Performance**: 3x faster than YouTube Summarizer
|
|
- **Cost**: <$0.01 per transcript with caching
|
|
- **Scale**: Handle 1000+ files/day by v3
|
|
|
|
## Development Workflow
|
|
|
|
### Phase 1: Foundation (Weeks 1-2)
|
|
- PostgreSQL database operational
|
|
- Basic Whisper transcription working
|
|
- Batch processing for 10+ files
|
|
- JSON/TXT export functional
|
|
- CLI with basic commands
|
|
|
|
### Phase 2: Enhancement (Week 3)
|
|
- DeepSeek integration complete
|
|
- Enhancement templates working
|
|
- Progress tracking implemented
|
|
- Quality validation checks
|
|
|
|
### Phase 3: Optimization (Weeks 4-5)
|
|
- Multi-pass implementation
|
|
- Confidence scoring system
|
|
- Performance metrics dashboard
|
|
- Batch optimization
|
|
|
|
### Phase 4: Advanced Features (Week 6+)
|
|
- Speaker diarization working
|
|
- Voice embedding database
|
|
- Caching layer operational
|
|
|
|
---
|
|
|
|
*Last Updated: 2024*
|
|
*Patterns Version: 1.0*
|