423 lines
11 KiB
Markdown
423 lines
11 KiB
Markdown
# Iterative Pipeline Architecture
|
|
|
|
## Version Evolution: v1 → v2 → v3 → v4
|
|
|
|
### Overview
|
|
|
|
The Trax transcription pipeline is designed for clean iteration, where each version builds upon the previous without breaking changes. This allows for gradual feature addition and performance improvement while maintaining backward compatibility.
|
|
|
|
## Pipeline Versions
|
|
|
|
### Version 1: Basic Transcription
|
|
**Timeline**: Weeks 1-2
|
|
**Focus**: Core functionality
|
|
|
|
```python
|
|
async def transcribe_v1(audio_path: Path) -> dict:
|
|
"""Basic transcription with optimizations"""
|
|
# Step 1: Preprocess audio to 16kHz mono
|
|
processed_audio = await preprocess_audio(audio_path)
|
|
|
|
# Step 2: Run through Whisper
|
|
transcript = await whisper.transcribe(
|
|
processed_audio,
|
|
model="distil-large-v3",
|
|
compute_type="int8_float32"
|
|
)
|
|
|
|
# Step 3: Format as JSON
|
|
return format_transcript_json(transcript)
|
|
```
|
|
|
|
**Features**:
|
|
- Single-pass Whisper transcription
|
|
- M3 optimizations (distil-large-v3 model)
|
|
- Audio preprocessing (16kHz mono)
|
|
- JSON output format
|
|
- Basic error handling
|
|
|
|
**Performance Targets**:
|
|
- 5-minute audio: <30 seconds
|
|
- Accuracy: 95%
|
|
- Memory usage: <2GB
|
|
|
|
### Version 2: AI Enhancement
|
|
**Timeline**: Week 3
|
|
**Focus**: Quality improvement
|
|
|
|
```python
|
|
async def transcribe_v2(audio_path: Path) -> dict:
|
|
"""v1 + AI enhancement"""
|
|
# Get base transcript
|
|
transcript = await transcribe_v1(audio_path)
|
|
|
|
# Apply AI enhancement
|
|
enhanced = await enhance_with_ai(
|
|
transcript,
|
|
model="deepseek-chat",
|
|
prompt=ENHANCEMENT_PROMPT
|
|
)
|
|
|
|
# Store both versions
|
|
return {
|
|
"raw": transcript,
|
|
"enhanced": enhanced,
|
|
"version": "v2"
|
|
}
|
|
```
|
|
|
|
**New Features**:
|
|
- DeepSeek AI enhancement
|
|
- Punctuation and capitalization correction
|
|
- Technical term fixes
|
|
- Paragraph formatting
|
|
- Quality scoring
|
|
|
|
**Performance Targets**:
|
|
- 5-minute audio: <35 seconds
|
|
- Accuracy: 99%
|
|
- Enhancement time: <5 seconds
|
|
|
|
### Version 3: Multi-Pass Accuracy
|
|
**Timeline**: Weeks 4-5
|
|
**Focus**: Maximum accuracy
|
|
|
|
```python
|
|
async def transcribe_v3(audio_path: Path) -> dict:
|
|
"""v2 + multiple passes for accuracy"""
|
|
passes = []
|
|
|
|
# Run multiple passes with different parameters
|
|
for i in range(3):
|
|
transcript = await transcribe_v1_with_params(
|
|
audio_path,
|
|
temperature=0.0 + (0.2 * i),
|
|
beam_size=2 + i,
|
|
best_of=3 + i
|
|
)
|
|
passes.append(transcript)
|
|
|
|
# Merge best segments based on confidence
|
|
merged = await merge_transcripts(
|
|
passes,
|
|
strategy="confidence_weighted"
|
|
)
|
|
|
|
# Apply enhancement
|
|
enhanced = await enhance_with_ai(merged)
|
|
|
|
return {
|
|
"raw": merged,
|
|
"enhanced": enhanced,
|
|
"passes": passes,
|
|
"confidence_scores": calculate_confidence(passes),
|
|
"version": "v3"
|
|
}
|
|
```
|
|
|
|
**New Features**:
|
|
- Multiple transcription passes
|
|
- Confidence scoring per segment
|
|
- Best segment selection
|
|
- Parameter variation
|
|
- Consensus building
|
|
|
|
**Performance Targets**:
|
|
- 5-minute audio: <25 seconds (parallel passes)
|
|
- Accuracy: 99.5%
|
|
- Confidence reporting: Per segment
|
|
|
|
### Version 4: Speaker Diarization
|
|
**Timeline**: Week 6+
|
|
**Focus**: Speaker separation
|
|
|
|
```python
|
|
async def transcribe_v4(audio_path: Path) -> dict:
|
|
"""v3 + speaker diarization"""
|
|
# Get high-quality transcript
|
|
transcript = await transcribe_v3(audio_path)
|
|
|
|
# Perform speaker diarization
|
|
diarization = await diarize_audio(
|
|
audio_path,
|
|
max_speakers=10,
|
|
min_duration=1.0
|
|
)
|
|
|
|
# Assign speakers to transcript segments
|
|
labeled_transcript = await assign_speakers(
|
|
transcript,
|
|
diarization
|
|
)
|
|
|
|
# Create speaker profiles
|
|
profiles = await create_speaker_profiles(
|
|
audio_path,
|
|
diarization
|
|
)
|
|
|
|
return {
|
|
"transcript": labeled_transcript,
|
|
"speakers": profiles,
|
|
"diarization": diarization,
|
|
"version": "v4"
|
|
}
|
|
```
|
|
|
|
**New Features**:
|
|
- Speaker separation
|
|
- Voice embeddings
|
|
- Speaker time tracking
|
|
- Turn-taking analysis
|
|
- Speaker profiles
|
|
|
|
**Performance Targets**:
|
|
- 5-minute audio: <30 seconds
|
|
- Speaker accuracy: 90%
|
|
- Max speakers: 10
|
|
|
|
## Pipeline Orchestration
|
|
|
|
### Intelligent Version Selection
|
|
|
|
```python
|
|
class PipelineOrchestrator:
|
|
"""Manages pipeline version selection and execution"""
|
|
|
|
def __init__(self, config: Config):
|
|
self.config = config
|
|
self.metrics = MetricsCollector()
|
|
|
|
async def process(self, audio_path: Path) -> dict:
|
|
"""Process through appropriate pipeline version"""
|
|
start_time = time.time()
|
|
|
|
# Determine version based on config
|
|
version = self.config.PIPELINE_VERSION
|
|
|
|
# Route to appropriate pipeline
|
|
if version == "v1":
|
|
result = await transcribe_v1(audio_path)
|
|
elif version == "v2":
|
|
result = await transcribe_v2(audio_path)
|
|
elif version == "v3":
|
|
result = await transcribe_v3(audio_path)
|
|
elif version == "v4":
|
|
result = await transcribe_v4(audio_path)
|
|
else:
|
|
# Auto-select based on requirements
|
|
result = await self.auto_select(audio_path)
|
|
|
|
# Track metrics
|
|
self.metrics.track(
|
|
version=version,
|
|
file=audio_path,
|
|
duration=time.time() - start_time
|
|
)
|
|
|
|
return result
|
|
|
|
async def auto_select(self, audio_path: Path) -> dict:
|
|
"""Automatically select best pipeline version"""
|
|
file_size = audio_path.stat().st_size
|
|
|
|
# Use v1 for quick processing
|
|
if self.config.SPEED_PRIORITY:
|
|
return await transcribe_v1(audio_path)
|
|
|
|
# Use v3 for accuracy
|
|
if self.config.ACCURACY_PRIORITY:
|
|
return await transcribe_v3(audio_path)
|
|
|
|
# Use v4 for multi-speaker
|
|
if await self.detect_multiple_speakers(audio_path):
|
|
return await transcribe_v4(audio_path)
|
|
|
|
# Default to v2 for balance
|
|
return await transcribe_v2(audio_path)
|
|
```
|
|
|
|
## Version Compatibility
|
|
|
|
### Database Schema Evolution
|
|
|
|
```sql
|
|
-- v1: Basic schema
|
|
CREATE TABLE transcripts_v1 (
|
|
id UUID PRIMARY KEY,
|
|
media_file_id UUID,
|
|
raw_content JSONB,
|
|
created_at TIMESTAMP
|
|
);
|
|
|
|
-- v2: Add enhancement
|
|
ALTER TABLE transcripts_v1
|
|
ADD COLUMN enhanced_content JSONB;
|
|
|
|
-- v3: Add multi-pass data
|
|
ALTER TABLE transcripts_v1
|
|
ADD COLUMN multipass_content JSONB,
|
|
ADD COLUMN confidence_scores JSONB;
|
|
|
|
-- v4: Add diarization
|
|
ALTER TABLE transcripts_v1
|
|
ADD COLUMN diarized_content JSONB,
|
|
ADD COLUMN speaker_profiles JSONB;
|
|
|
|
-- Rename for clarity
|
|
ALTER TABLE transcripts_v1 RENAME TO transcripts;
|
|
```
|
|
|
|
### Version Migration
|
|
|
|
```python
|
|
class VersionMigrator:
|
|
"""Handles transcript version migrations"""
|
|
|
|
async def upgrade_transcript(
|
|
self,
|
|
transcript_id: str,
|
|
target_version: str
|
|
) -> dict:
|
|
"""Upgrade transcript to higher version"""
|
|
current = await self.get_transcript(transcript_id)
|
|
current_version = current.get("version", "v1")
|
|
|
|
if current_version >= target_version:
|
|
return current # Already at or above target
|
|
|
|
# Progressive upgrade
|
|
result = current
|
|
for version in self.get_upgrade_path(current_version, target_version):
|
|
result = await self.upgrade_to_version(result, version)
|
|
|
|
return result
|
|
|
|
def get_upgrade_path(self, from_version: str, to_version: str) -> List[str]:
|
|
"""Get ordered list of versions to upgrade through"""
|
|
versions = ["v1", "v2", "v3", "v4"]
|
|
start_idx = versions.index(from_version)
|
|
end_idx = versions.index(to_version)
|
|
return versions[start_idx + 1:end_idx + 1]
|
|
```
|
|
|
|
## Testing Strategy
|
|
|
|
### Version-Specific Tests
|
|
|
|
```python
|
|
# tests/test_pipeline_v1.py
|
|
@pytest.mark.asyncio
|
|
async def test_v1_basic_transcription():
|
|
audio = Path("tests/fixtures/audio/sample_5s.wav")
|
|
result = await transcribe_v1(audio)
|
|
|
|
assert "text" in result
|
|
assert len(result["text"]) > 0
|
|
assert result["duration"] == pytest.approx(5.0, rel=0.1)
|
|
|
|
# tests/test_pipeline_v2.py
|
|
@pytest.mark.asyncio
|
|
async def test_v2_enhancement():
|
|
audio = Path("tests/fixtures/audio/sample_30s.mp3")
|
|
result = await transcribe_v2(audio)
|
|
|
|
assert "enhanced" in result
|
|
assert result["enhanced"]["text"] != result["raw"]["text"]
|
|
assert has_proper_punctuation(result["enhanced"]["text"])
|
|
|
|
# tests/test_pipeline_v3.py
|
|
@pytest.mark.asyncio
|
|
async def test_v3_multipass():
|
|
audio = Path("tests/fixtures/audio/sample_2m.mp4")
|
|
result = await transcribe_v3(audio)
|
|
|
|
assert "confidence_scores" in result
|
|
assert len(result["passes"]) == 3
|
|
assert all(score > 0.8 for score in result["confidence_scores"])
|
|
|
|
# tests/test_pipeline_v4.py
|
|
@pytest.mark.asyncio
|
|
async def test_v4_diarization():
|
|
audio = Path("tests/fixtures/audio/sample_conversation.wav")
|
|
result = await transcribe_v4(audio)
|
|
|
|
assert "speakers" in result
|
|
assert len(result["speakers"]) >= 2
|
|
assert all("speaker_" in segment for segment in result["transcript"])
|
|
```
|
|
|
|
### Compatibility Tests
|
|
|
|
```python
|
|
@pytest.mark.asyncio
|
|
async def test_version_compatibility():
|
|
"""Ensure all versions can process same file"""
|
|
audio = Path("tests/fixtures/audio/sample_30s.mp3")
|
|
|
|
v1_result = await transcribe_v1(audio)
|
|
v2_result = await transcribe_v2(audio)
|
|
v3_result = await transcribe_v3(audio)
|
|
v4_result = await transcribe_v4(audio)
|
|
|
|
# All should produce valid transcripts
|
|
assert all(r.get("text") or r.get("raw", {}).get("text")
|
|
for r in [v1_result, v2_result, v3_result, v4_result])
|
|
|
|
# Higher versions should be more accurate
|
|
assert len(v2_result["enhanced"]["text"]) >= len(v1_result["text"])
|
|
assert v3_result["confidence_scores"][0] >= 0.8
|
|
```
|
|
|
|
## Performance Benchmarks
|
|
|
|
### Expected Performance by Version
|
|
|
|
| Metric | v1 | v2 | v3 | v4 |
|
|
|--------|----|----|----|
|
|
----|
|
|
| **5-min audio processing** | 30s | 35s | 25s | 30s |
|
|
| **Accuracy** | 95% | 99% | 99.5% | 99.5% |
|
|
| **Memory usage** | 2GB | 2GB | 3GB | 4GB |
|
|
| **Cost per transcript** | $0.001 | $0.005 | $0.008 | $0.010 |
|
|
| **Parallel capability** | 10 files | 10 files | 5 files | 3 files |
|
|
|
|
## Configuration
|
|
|
|
### Version Selection Config
|
|
|
|
```python
|
|
# .env configuration
|
|
PIPELINE_VERSION=v2 # Default version
|
|
ENABLE_ENHANCEMENT=true
|
|
ENABLE_MULTIPASS=false
|
|
ENABLE_DIARIZATION=false
|
|
|
|
# Feature flags for gradual rollout
|
|
MULTIPASS_BETA_USERS=["user1", "user2"]
|
|
DIARIZATION_BETA_USERS=["user3"]
|
|
|
|
# Performance tuning
|
|
MAX_PARALLEL_JOBS=4
|
|
CHUNK_SIZE_SECONDS=600
|
|
BEAM_SIZE=2
|
|
TEMPERATURE=0.0
|
|
```
|
|
|
|
## Summary
|
|
|
|
The iterative pipeline architecture provides:
|
|
|
|
1. **Clean progression** from basic to advanced features
|
|
2. **No breaking changes** between versions
|
|
3. **Flexible deployment** with feature flags
|
|
4. **Performance tracking** across versions
|
|
5. **Easy rollback** if issues arise
|
|
6. **Clear testing** strategy per version
|
|
|
|
Each version is production-ready and can be deployed independently, allowing for gradual feature rollout and risk mitigation.
|
|
|
|
---
|
|
|
|
*Last Updated: 2024*
|
|
*Architecture Version: 1.0* |