trax/docs/architecture/iterative-pipeline.md

11 KiB

Iterative Pipeline Architecture

Version Evolution: v1 → v2 → v3 → v4

Overview

The Trax transcription pipeline is designed for clean iteration, where each version builds upon the previous without breaking changes. This allows for gradual feature addition and performance improvement while maintaining backward compatibility.

Pipeline Versions

Version 1: Basic Transcription

Timeline: Weeks 1-2 Focus: Core functionality

async def transcribe_v1(audio_path: Path) -> dict:
    """Basic transcription with optimizations"""
    # Step 1: Preprocess audio to 16kHz mono
    processed_audio = await preprocess_audio(audio_path)
    
    # Step 2: Run through Whisper
    transcript = await whisper.transcribe(
        processed_audio,
        model="distil-large-v3",
        compute_type="int8_float32"
    )
    
    # Step 3: Format as JSON
    return format_transcript_json(transcript)

Features:

  • Single-pass Whisper transcription
  • M3 optimizations (distil-large-v3 model)
  • Audio preprocessing (16kHz mono)
  • JSON output format
  • Basic error handling

Performance Targets:

  • 5-minute audio: <30 seconds
  • Accuracy: 95%
  • Memory usage: <2GB

Version 2: AI Enhancement

Timeline: Week 3 Focus: Quality improvement

async def transcribe_v2(audio_path: Path) -> dict:
    """v1 + AI enhancement"""
    # Get base transcript
    transcript = await transcribe_v1(audio_path)
    
    # Apply AI enhancement
    enhanced = await enhance_with_ai(
        transcript,
        model="deepseek-chat",
        prompt=ENHANCEMENT_PROMPT
    )
    
    # Store both versions
    return {
        "raw": transcript,
        "enhanced": enhanced,
        "version": "v2"
    }

New Features:

  • DeepSeek AI enhancement
  • Punctuation and capitalization correction
  • Technical term fixes
  • Paragraph formatting
  • Quality scoring

Performance Targets:

  • 5-minute audio: <35 seconds
  • Accuracy: 99%
  • Enhancement time: <5 seconds

Version 3: Multi-Pass Accuracy

Timeline: Weeks 4-5 Focus: Maximum accuracy

async def transcribe_v3(audio_path: Path) -> dict:
    """v2 + multiple passes for accuracy"""
    passes = []
    
    # Run multiple passes with different parameters
    for i in range(3):
        transcript = await transcribe_v1_with_params(
            audio_path,
            temperature=0.0 + (0.2 * i),
            beam_size=2 + i,
            best_of=3 + i
        )
        passes.append(transcript)
    
    # Merge best segments based on confidence
    merged = await merge_transcripts(
        passes,
        strategy="confidence_weighted"
    )
    
    # Apply enhancement
    enhanced = await enhance_with_ai(merged)
    
    return {
        "raw": merged,
        "enhanced": enhanced,
        "passes": passes,
        "confidence_scores": calculate_confidence(passes),
        "version": "v3"
    }

New Features:

  • Multiple transcription passes
  • Confidence scoring per segment
  • Best segment selection
  • Parameter variation
  • Consensus building

Performance Targets:

  • 5-minute audio: <25 seconds (parallel passes)
  • Accuracy: 99.5%
  • Confidence reporting: Per segment

Version 4: Speaker Diarization

Timeline: Week 6+ Focus: Speaker separation

async def transcribe_v4(audio_path: Path) -> dict:
    """v3 + speaker diarization"""
    # Get high-quality transcript
    transcript = await transcribe_v3(audio_path)
    
    # Perform speaker diarization
    diarization = await diarize_audio(
        audio_path,
        max_speakers=10,
        min_duration=1.0
    )
    
    # Assign speakers to transcript segments
    labeled_transcript = await assign_speakers(
        transcript,
        diarization
    )
    
    # Create speaker profiles
    profiles = await create_speaker_profiles(
        audio_path,
        diarization
    )
    
    return {
        "transcript": labeled_transcript,
        "speakers": profiles,
        "diarization": diarization,
        "version": "v4"
    }

New Features:

  • Speaker separation
  • Voice embeddings
  • Speaker time tracking
  • Turn-taking analysis
  • Speaker profiles

Performance Targets:

  • 5-minute audio: <30 seconds
  • Speaker accuracy: 90%
  • Max speakers: 10

Pipeline Orchestration

Intelligent Version Selection

class PipelineOrchestrator:
    """Manages pipeline version selection and execution"""
    
    def __init__(self, config: Config):
        self.config = config
        self.metrics = MetricsCollector()
    
    async def process(self, audio_path: Path) -> dict:
        """Process through appropriate pipeline version"""
        start_time = time.time()
        
        # Determine version based on config
        version = self.config.PIPELINE_VERSION
        
        # Route to appropriate pipeline
        if version == "v1":
            result = await transcribe_v1(audio_path)
        elif version == "v2":
            result = await transcribe_v2(audio_path)
        elif version == "v3":
            result = await transcribe_v3(audio_path)
        elif version == "v4":
            result = await transcribe_v4(audio_path)
        else:
            # Auto-select based on requirements
            result = await self.auto_select(audio_path)
        
        # Track metrics
        self.metrics.track(
            version=version,
            file=audio_path,
            duration=time.time() - start_time
        )
        
        return result
    
    async def auto_select(self, audio_path: Path) -> dict:
        """Automatically select best pipeline version"""
        file_size = audio_path.stat().st_size
        
        # Use v1 for quick processing
        if self.config.SPEED_PRIORITY:
            return await transcribe_v1(audio_path)
        
        # Use v3 for accuracy
        if self.config.ACCURACY_PRIORITY:
            return await transcribe_v3(audio_path)
        
        # Use v4 for multi-speaker
        if await self.detect_multiple_speakers(audio_path):
            return await transcribe_v4(audio_path)
        
        # Default to v2 for balance
        return await transcribe_v2(audio_path)

Version Compatibility

Database Schema Evolution

-- v1: Basic schema
CREATE TABLE transcripts_v1 (
    id UUID PRIMARY KEY,
    media_file_id UUID,
    raw_content JSONB,
    created_at TIMESTAMP
);

-- v2: Add enhancement
ALTER TABLE transcripts_v1 
ADD COLUMN enhanced_content JSONB;

-- v3: Add multi-pass data
ALTER TABLE transcripts_v1
ADD COLUMN multipass_content JSONB,
ADD COLUMN confidence_scores JSONB;

-- v4: Add diarization
ALTER TABLE transcripts_v1
ADD COLUMN diarized_content JSONB,
ADD COLUMN speaker_profiles JSONB;

-- Rename for clarity
ALTER TABLE transcripts_v1 RENAME TO transcripts;

Version Migration

class VersionMigrator:
    """Handles transcript version migrations"""
    
    async def upgrade_transcript(
        self,
        transcript_id: str,
        target_version: str
    ) -> dict:
        """Upgrade transcript to higher version"""
        current = await self.get_transcript(transcript_id)
        current_version = current.get("version", "v1")
        
        if current_version >= target_version:
            return current  # Already at or above target
        
        # Progressive upgrade
        result = current
        for version in self.get_upgrade_path(current_version, target_version):
            result = await self.upgrade_to_version(result, version)
        
        return result
    
    def get_upgrade_path(self, from_version: str, to_version: str) -> List[str]:
        """Get ordered list of versions to upgrade through"""
        versions = ["v1", "v2", "v3", "v4"]
        start_idx = versions.index(from_version)
        end_idx = versions.index(to_version)
        return versions[start_idx + 1:end_idx + 1]

Testing Strategy

Version-Specific Tests

# tests/test_pipeline_v1.py
@pytest.mark.asyncio
async def test_v1_basic_transcription():
    audio = Path("tests/fixtures/audio/sample_5s.wav")
    result = await transcribe_v1(audio)
    
    assert "text" in result
    assert len(result["text"]) > 0
    assert result["duration"] == pytest.approx(5.0, rel=0.1)

# tests/test_pipeline_v2.py
@pytest.mark.asyncio
async def test_v2_enhancement():
    audio = Path("tests/fixtures/audio/sample_30s.mp3")
    result = await transcribe_v2(audio)
    
    assert "enhanced" in result
    assert result["enhanced"]["text"] != result["raw"]["text"]
    assert has_proper_punctuation(result["enhanced"]["text"])

# tests/test_pipeline_v3.py
@pytest.mark.asyncio
async def test_v3_multipass():
    audio = Path("tests/fixtures/audio/sample_2m.mp4")
    result = await transcribe_v3(audio)
    
    assert "confidence_scores" in result
    assert len(result["passes"]) == 3
    assert all(score > 0.8 for score in result["confidence_scores"])

# tests/test_pipeline_v4.py
@pytest.mark.asyncio  
async def test_v4_diarization():
    audio = Path("tests/fixtures/audio/sample_conversation.wav")
    result = await transcribe_v4(audio)
    
    assert "speakers" in result
    assert len(result["speakers"]) >= 2
    assert all("speaker_" in segment for segment in result["transcript"])

Compatibility Tests

@pytest.mark.asyncio
async def test_version_compatibility():
    """Ensure all versions can process same file"""
    audio = Path("tests/fixtures/audio/sample_30s.mp3")
    
    v1_result = await transcribe_v1(audio)
    v2_result = await transcribe_v2(audio)
    v3_result = await transcribe_v3(audio)
    v4_result = await transcribe_v4(audio)
    
    # All should produce valid transcripts
    assert all(r.get("text") or r.get("raw", {}).get("text") 
               for r in [v1_result, v2_result, v3_result, v4_result])
    
    # Higher versions should be more accurate
    assert len(v2_result["enhanced"]["text"]) >= len(v1_result["text"])
    assert v3_result["confidence_scores"][0] >= 0.8

Performance Benchmarks

Expected Performance by Version

| Metric | v1 | v2 | v3 | v4 | |--------|----|----|----| ----| | 5-min audio processing | 30s | 35s | 25s | 30s | | Accuracy | 95% | 99% | 99.5% | 99.5% | | Memory usage | 2GB | 2GB | 3GB | 4GB | | Cost per transcript | $0.001 | $0.005 | $0.008 | $0.010 | | Parallel capability | 10 files | 10 files | 5 files | 3 files |

Configuration

Version Selection Config

# .env configuration
PIPELINE_VERSION=v2  # Default version
ENABLE_ENHANCEMENT=true
ENABLE_MULTIPASS=false
ENABLE_DIARIZATION=false

# Feature flags for gradual rollout
MULTIPASS_BETA_USERS=["user1", "user2"]
DIARIZATION_BETA_USERS=["user3"]

# Performance tuning
MAX_PARALLEL_JOBS=4
CHUNK_SIZE_SECONDS=600
BEAM_SIZE=2
TEMPERATURE=0.0

Summary

The iterative pipeline architecture provides:

  1. Clean progression from basic to advanced features
  2. No breaking changes between versions
  3. Flexible deployment with feature flags
  4. Performance tracking across versions
  5. Easy rollback if issues arise
  6. Clear testing strategy per version

Each version is production-ready and can be deployed independently, allowing for gradual feature rollout and risk mitigation.


Last Updated: 2024
Architecture Version: 1.0