trax/docs/architecture/iterative-pipeline.md

423 lines
11 KiB
Markdown

# Iterative Pipeline Architecture
## Version Evolution: v1 → v2 → v3 → v4
### Overview
The Trax transcription pipeline is designed for clean iteration, where each version builds upon the previous without breaking changes. This allows for gradual feature addition and performance improvement while maintaining backward compatibility.
## Pipeline Versions
### Version 1: Basic Transcription
**Timeline**: Weeks 1-2
**Focus**: Core functionality
```python
async def transcribe_v1(audio_path: Path) -> dict:
"""Basic transcription with optimizations"""
# Step 1: Preprocess audio to 16kHz mono
processed_audio = await preprocess_audio(audio_path)
# Step 2: Run through Whisper
transcript = await whisper.transcribe(
processed_audio,
model="distil-large-v3",
compute_type="int8_float32"
)
# Step 3: Format as JSON
return format_transcript_json(transcript)
```
**Features**:
- Single-pass Whisper transcription
- M3 optimizations (distil-large-v3 model)
- Audio preprocessing (16kHz mono)
- JSON output format
- Basic error handling
**Performance Targets**:
- 5-minute audio: <30 seconds
- Accuracy: 95%
- Memory usage: <2GB
### Version 2: AI Enhancement
**Timeline**: Week 3
**Focus**: Quality improvement
```python
async def transcribe_v2(audio_path: Path) -> dict:
"""v1 + AI enhancement"""
# Get base transcript
transcript = await transcribe_v1(audio_path)
# Apply AI enhancement
enhanced = await enhance_with_ai(
transcript,
model="deepseek-chat",
prompt=ENHANCEMENT_PROMPT
)
# Store both versions
return {
"raw": transcript,
"enhanced": enhanced,
"version": "v2"
}
```
**New Features**:
- DeepSeek AI enhancement
- Punctuation and capitalization correction
- Technical term fixes
- Paragraph formatting
- Quality scoring
**Performance Targets**:
- 5-minute audio: <35 seconds
- Accuracy: 99%
- Enhancement time: <5 seconds
### Version 3: Multi-Pass Accuracy
**Timeline**: Weeks 4-5
**Focus**: Maximum accuracy
```python
async def transcribe_v3(audio_path: Path) -> dict:
"""v2 + multiple passes for accuracy"""
passes = []
# Run multiple passes with different parameters
for i in range(3):
transcript = await transcribe_v1_with_params(
audio_path,
temperature=0.0 + (0.2 * i),
beam_size=2 + i,
best_of=3 + i
)
passes.append(transcript)
# Merge best segments based on confidence
merged = await merge_transcripts(
passes,
strategy="confidence_weighted"
)
# Apply enhancement
enhanced = await enhance_with_ai(merged)
return {
"raw": merged,
"enhanced": enhanced,
"passes": passes,
"confidence_scores": calculate_confidence(passes),
"version": "v3"
}
```
**New Features**:
- Multiple transcription passes
- Confidence scoring per segment
- Best segment selection
- Parameter variation
- Consensus building
**Performance Targets**:
- 5-minute audio: <25 seconds (parallel passes)
- Accuracy: 99.5%
- Confidence reporting: Per segment
### Version 4: Speaker Diarization
**Timeline**: Week 6+
**Focus**: Speaker separation
```python
async def transcribe_v4(audio_path: Path) -> dict:
"""v3 + speaker diarization"""
# Get high-quality transcript
transcript = await transcribe_v3(audio_path)
# Perform speaker diarization
diarization = await diarize_audio(
audio_path,
max_speakers=10,
min_duration=1.0
)
# Assign speakers to transcript segments
labeled_transcript = await assign_speakers(
transcript,
diarization
)
# Create speaker profiles
profiles = await create_speaker_profiles(
audio_path,
diarization
)
return {
"transcript": labeled_transcript,
"speakers": profiles,
"diarization": diarization,
"version": "v4"
}
```
**New Features**:
- Speaker separation
- Voice embeddings
- Speaker time tracking
- Turn-taking analysis
- Speaker profiles
**Performance Targets**:
- 5-minute audio: <30 seconds
- Speaker accuracy: 90%
- Max speakers: 10
## Pipeline Orchestration
### Intelligent Version Selection
```python
class PipelineOrchestrator:
"""Manages pipeline version selection and execution"""
def __init__(self, config: Config):
self.config = config
self.metrics = MetricsCollector()
async def process(self, audio_path: Path) -> dict:
"""Process through appropriate pipeline version"""
start_time = time.time()
# Determine version based on config
version = self.config.PIPELINE_VERSION
# Route to appropriate pipeline
if version == "v1":
result = await transcribe_v1(audio_path)
elif version == "v2":
result = await transcribe_v2(audio_path)
elif version == "v3":
result = await transcribe_v3(audio_path)
elif version == "v4":
result = await transcribe_v4(audio_path)
else:
# Auto-select based on requirements
result = await self.auto_select(audio_path)
# Track metrics
self.metrics.track(
version=version,
file=audio_path,
duration=time.time() - start_time
)
return result
async def auto_select(self, audio_path: Path) -> dict:
"""Automatically select best pipeline version"""
file_size = audio_path.stat().st_size
# Use v1 for quick processing
if self.config.SPEED_PRIORITY:
return await transcribe_v1(audio_path)
# Use v3 for accuracy
if self.config.ACCURACY_PRIORITY:
return await transcribe_v3(audio_path)
# Use v4 for multi-speaker
if await self.detect_multiple_speakers(audio_path):
return await transcribe_v4(audio_path)
# Default to v2 for balance
return await transcribe_v2(audio_path)
```
## Version Compatibility
### Database Schema Evolution
```sql
-- v1: Basic schema
CREATE TABLE transcripts_v1 (
id UUID PRIMARY KEY,
media_file_id UUID,
raw_content JSONB,
created_at TIMESTAMP
);
-- v2: Add enhancement
ALTER TABLE transcripts_v1
ADD COLUMN enhanced_content JSONB;
-- v3: Add multi-pass data
ALTER TABLE transcripts_v1
ADD COLUMN multipass_content JSONB,
ADD COLUMN confidence_scores JSONB;
-- v4: Add diarization
ALTER TABLE transcripts_v1
ADD COLUMN diarized_content JSONB,
ADD COLUMN speaker_profiles JSONB;
-- Rename for clarity
ALTER TABLE transcripts_v1 RENAME TO transcripts;
```
### Version Migration
```python
class VersionMigrator:
"""Handles transcript version migrations"""
async def upgrade_transcript(
self,
transcript_id: str,
target_version: str
) -> dict:
"""Upgrade transcript to higher version"""
current = await self.get_transcript(transcript_id)
current_version = current.get("version", "v1")
if current_version >= target_version:
return current # Already at or above target
# Progressive upgrade
result = current
for version in self.get_upgrade_path(current_version, target_version):
result = await self.upgrade_to_version(result, version)
return result
def get_upgrade_path(self, from_version: str, to_version: str) -> List[str]:
"""Get ordered list of versions to upgrade through"""
versions = ["v1", "v2", "v3", "v4"]
start_idx = versions.index(from_version)
end_idx = versions.index(to_version)
return versions[start_idx + 1:end_idx + 1]
```
## Testing Strategy
### Version-Specific Tests
```python
# tests/test_pipeline_v1.py
@pytest.mark.asyncio
async def test_v1_basic_transcription():
audio = Path("tests/fixtures/audio/sample_5s.wav")
result = await transcribe_v1(audio)
assert "text" in result
assert len(result["text"]) > 0
assert result["duration"] == pytest.approx(5.0, rel=0.1)
# tests/test_pipeline_v2.py
@pytest.mark.asyncio
async def test_v2_enhancement():
audio = Path("tests/fixtures/audio/sample_30s.mp3")
result = await transcribe_v2(audio)
assert "enhanced" in result
assert result["enhanced"]["text"] != result["raw"]["text"]
assert has_proper_punctuation(result["enhanced"]["text"])
# tests/test_pipeline_v3.py
@pytest.mark.asyncio
async def test_v3_multipass():
audio = Path("tests/fixtures/audio/sample_2m.mp4")
result = await transcribe_v3(audio)
assert "confidence_scores" in result
assert len(result["passes"]) == 3
assert all(score > 0.8 for score in result["confidence_scores"])
# tests/test_pipeline_v4.py
@pytest.mark.asyncio
async def test_v4_diarization():
audio = Path("tests/fixtures/audio/sample_conversation.wav")
result = await transcribe_v4(audio)
assert "speakers" in result
assert len(result["speakers"]) >= 2
assert all("speaker_" in segment for segment in result["transcript"])
```
### Compatibility Tests
```python
@pytest.mark.asyncio
async def test_version_compatibility():
"""Ensure all versions can process same file"""
audio = Path("tests/fixtures/audio/sample_30s.mp3")
v1_result = await transcribe_v1(audio)
v2_result = await transcribe_v2(audio)
v3_result = await transcribe_v3(audio)
v4_result = await transcribe_v4(audio)
# All should produce valid transcripts
assert all(r.get("text") or r.get("raw", {}).get("text")
for r in [v1_result, v2_result, v3_result, v4_result])
# Higher versions should be more accurate
assert len(v2_result["enhanced"]["text"]) >= len(v1_result["text"])
assert v3_result["confidence_scores"][0] >= 0.8
```
## Performance Benchmarks
### Expected Performance by Version
| Metric | v1 | v2 | v3 | v4 |
|--------|----|----|----|
----|
| **5-min audio processing** | 30s | 35s | 25s | 30s |
| **Accuracy** | 95% | 99% | 99.5% | 99.5% |
| **Memory usage** | 2GB | 2GB | 3GB | 4GB |
| **Cost per transcript** | $0.001 | $0.005 | $0.008 | $0.010 |
| **Parallel capability** | 10 files | 10 files | 5 files | 3 files |
## Configuration
### Version Selection Config
```python
# .env configuration
PIPELINE_VERSION=v2 # Default version
ENABLE_ENHANCEMENT=true
ENABLE_MULTIPASS=false
ENABLE_DIARIZATION=false
# Feature flags for gradual rollout
MULTIPASS_BETA_USERS=["user1", "user2"]
DIARIZATION_BETA_USERS=["user3"]
# Performance tuning
MAX_PARALLEL_JOBS=4
CHUNK_SIZE_SECONDS=600
BEAM_SIZE=2
TEMPERATURE=0.0
```
## Summary
The iterative pipeline architecture provides:
1. **Clean progression** from basic to advanced features
2. **No breaking changes** between versions
3. **Flexible deployment** with feature flags
4. **Performance tracking** across versions
5. **Easy rollback** if issues arise
6. **Clear testing** strategy per version
Each version is production-ready and can be deployed independently, allowing for gradual feature rollout and risk mitigation.
---
*Last Updated: 2024*
*Architecture Version: 1.0*