trax/docs/reports/05-technical-migration.md

18 KiB

Checkpoint 5: Technical Migration Report

uv Package Manager Migration & Development Setup

1. Migration from pip to uv

Current State

  • Trax already has pyproject.toml configured for uv
  • Basic [tool.uv] section present
  • Development dependencies defined
  • Virtual environment in .venv/

Migration Steps

Phase 1: Core Dependencies
[project]
name = "trax"
version = "0.1.0"
description = "Media transcription platform with iterative enhancement"
readme = "README.md"
requires-python = ">=3.11"
dependencies = [
    "python-dotenv>=1.0.0",
    "sqlalchemy>=2.0.0",
    "alembic>=1.13.0",
    "psycopg2-binary>=2.9.0",
    "pydantic>=2.0.0",
    "click>=8.1.0",
    "rich>=13.0.0",  # For CLI output
    "asyncio>=3.4.3",
]
Phase 2: Transcription Dependencies
dependencies += [
    "faster-whisper>=1.0.0",
    "yt-dlp>=2024.0.0",
    "ffmpeg-python>=0.2.0",
    "pydub>=0.25.0",
    "librosa>=0.10.0",  # Audio analysis
    "numpy>=1.24.0",
    "scipy>=1.11.0",
]
Phase 3: AI Enhancement
dependencies += [
    "openai>=1.0.0",     # For DeepSeek API
    "aiohttp>=3.9.0",
    "tenacity>=8.2.0",   # Retry logic
    "jinja2>=3.1.0",     # Templates
]
Phase 4: Advanced Features
dependencies += [
    "pyannote.audio>=3.0.0",  # Speaker diarization
    "torch>=2.0.0",           # For ML models
    "torchaudio>=2.0.0",
]

Migration Commands

# Initial setup
cd apps/trax
uv venv                               # Create venv
source .venv/bin/activate             # Activate
uv pip sync                          # Install from lock
uv pip compile pyproject.toml -o requirements.txt  # Generate lock

# Adding new packages
uv pip install package-name
uv pip compile pyproject.toml -o requirements.txt

# Development workflow
uv run pytest                        # Run tests
uv run python src/cli/main.py       # Run CLI
uv run black src/ tests/            # Format code
uv run ruff check src/ tests/       # Lint
uv run mypy src/                    # Type check

2. Documentation Consolidation

Current Documentation Status

  • CLAUDE.md: 97 lines (well under 600 limit)
  • AGENTS.md: 163 lines (well under 600 limit)
  • Total: 260 lines (can add 340 more)

Consolidation Strategy

Enhanced CLAUDE.md (~400 lines)
# CLAUDE.md

## Project Context (existing ~50 lines)
## Architecture Overview (NEW ~100 lines)
  - Service protocols
  - Pipeline versions (v1-v4)
  - Database schema
  - Batch processing design
## Essential Commands (existing ~30 lines)
## Development Workflow (NEW ~80 lines)
  - Iteration strategy
  - Testing approach
  - Batch processing
  - Version management
## API Reference (NEW ~80 lines)
  - CLI commands
  - Service interfaces
  - Protocol definitions
## Performance Targets (NEW ~40 lines)
  - Speed benchmarks
  - Accuracy goals
  - Resource limits
Enhanced AGENTS.md (~200 lines)
# AGENTS.md

## Development Rules (NEW ~50 lines)
  - Links to rule files
  - Quick reference
## Setup Commands (existing ~40 lines)
## Code Style (existing ~30 lines)
## Common Workflows (existing ~40 lines)
## Troubleshooting (NEW ~40 lines)

3. Code Quality Standards

Tool Configuration

# pyproject.toml additions

[tool.black]
line-length = 100
target-version = ['py311']
include = '\.pyi?$'
extend-exclude = '''
/(
  migrations
  | .venv
  | data
)/
'''

[tool.ruff]
line-length = 100
select = ["E", "F", "I", "N", "W", "B", "C90", "D"]
ignore = ["E501", "D100", "D104"]
exclude = ["migrations", ".venv", "data"]
fix = true
fixable = ["ALL"]

[tool.mypy]
python_version = "3.11"
warn_return_any = true
warn_unused_configs = true
disallow_untyped_defs = true
ignore_missing_imports = true
plugins = ["pydantic.mypy", "sqlalchemy.ext.mypy.plugin"]
exclude = ["migrations", "tests"]

[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py", "*_test.py"]
addopts = """
-v 
--cov=src 
--cov-report=html 
--cov-report=term 
--tb=short
"""
asyncio_mode = "auto"
markers = [
    "unit: Unit tests",
    "integration: Integration tests",
    "slow: Slow tests (>5s)",
    "batch: Batch processing tests",
]

[tool.coverage.run]
omit = [
    "*/tests/*",
    "*/migrations/*",
    "*/__pycache__/*",
]

[tool.coverage.report]
exclude_lines = [
    "pragma: no cover",
    "def __repr__",
    "raise AssertionError",
    "raise NotImplementedError",
    "if __name__ == .__main__.:",
]

4. Development Environment Setup

Setup Script (scripts/setup_dev.sh)

#!/bin/bash
set -e

echo "🚀 Setting up Trax development environment..."

# Color codes
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

# Check Python version
python_version=$(python3 --version | cut -d' ' -f2 | cut -d'.' -f1,2)
required_version="3.11"
if [ "$(printf '%s\n' "$required_version" "$python_version" | sort -V | head -n1)" != "$required_version" ]; then
    echo -e "${RED}❌ Python 3.11+ required (found $python_version)${NC}"
    exit 1
fi
echo -e "${GREEN}✅ Python $python_version${NC}"

# Install uv if needed
if ! command -v uv &> /dev/null; then
    echo -e "${YELLOW}📦 Installing uv...${NC}"
    curl -LsSf https://astral.sh/uv/install.sh | sh
    export PATH="$HOME/.cargo/bin:$PATH"
fi
echo -e "${GREEN}✅ uv installed${NC}"

# Setup virtual environment
echo -e "${YELLOW}🔧 Creating virtual environment...${NC}"
uv venv
source .venv/bin/activate

# Install dependencies
echo -e "${YELLOW}📚 Installing dependencies...${NC}"
uv pip install -e ".[dev]"

# Setup pre-commit hooks
echo -e "${YELLOW}🪝 Setting up pre-commit hooks...${NC}"
cat > .git/hooks/pre-commit << 'EOF'
#!/bin/bash
source .venv/bin/activate
echo "Running pre-commit checks..."
uv run black --check src/ tests/
uv run ruff check src/ tests/
uv run mypy src/
EOF
chmod +x .git/hooks/pre-commit

# Create directories
echo -e "${YELLOW}📁 Creating project directories...${NC}"
mkdir -p data/{media,exports,cache}
mkdir -p tests/{unit,integration,fixtures/audio,fixtures/transcripts}
mkdir -p src/agents/rules
mkdir -p docs/{reports,team,architecture}

# Check PostgreSQL
if command -v psql &> /dev/null; then
    echo -e "${GREEN}✅ PostgreSQL installed${NC}"
else
    echo -e "${YELLOW}⚠️  PostgreSQL not found - please install${NC}"
fi

# Check FFmpeg
if command -v ffmpeg &> /dev/null; then
    echo -e "${GREEN}✅ FFmpeg installed${NC}"
else
    echo -e "${YELLOW}⚠️  FFmpeg not found - please install${NC}"
fi

# Setup test data
echo -e "${YELLOW}🎵 Setting up test fixtures...${NC}"
cat > tests/fixtures/README.md << 'EOF'
# Test Fixtures

Place test audio files here:
- sample_5s.wav (5-second test)
- sample_30s.mp3 (30-second test)
- sample_2m.mp4 (2-minute test)

These should be real audio files for testing.
EOF

echo -e "${GREEN}✅ Development environment ready!${NC}"
echo ""
echo "📝 Next steps:"
echo "  1. source .venv/bin/activate"
echo "  2. Set up PostgreSQL database"
echo "  3. Add test audio files to tests/fixtures/audio/"
echo "  4. uv run pytest  # Run tests"
echo "  5. uv run python src/cli/main.py --help  # Run CLI"

5. Database Migration Strategy

Alembic Setup

# alembic.ini
[alembic]
script_location = migrations
prepend_sys_path = .
version_path_separator = os
sqlalchemy.url = postgresql://localhost/trax

[loggers]
keys = root,sqlalchemy,alembic

[handlers]
keys = console

[formatters]
keys = generic

[logger_root]
level = WARN
handlers = console
qualname =

[logger_sqlalchemy]
level = WARN
handlers =
qualname = sqlalchemy.engine

[logger_alembic]
level = INFO
handlers =
qualname = alembic

[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = NOTSET
formatter = generic

[formatter_generic]
format = %(levelname)-5.5s [%(name)s] %(message)s
datefmt = %H:%M:%S

Migration Sequence

# Phase 1: Core tables
alembic revision -m "create_media_and_transcripts"
# Creates: media_files, transcripts, exports

# Phase 2: Batch processing
alembic revision -m "add_batch_processing"
# Creates: batch_jobs, batch_items

# Phase 3: Audio metadata
alembic revision -m "add_audio_metadata"
# Creates: audio_processing_metadata

# Phase 4: Enhancement tracking
alembic revision -m "add_enhancement_fields"
# Adds: enhanced_content column

# Phase 5: Multi-pass support
alembic revision -m "add_multipass_tables"
# Creates: multipass_runs

# Phase 6: Diarization
alembic revision -m "add_speaker_diarization"
# Creates: speaker_profiles

# Commands
alembic upgrade head    # Apply all migrations
alembic current        # Show current version
alembic history        # Show migration history
alembic downgrade -1   # Rollback one migration

6. Testing Infrastructure

Test File Structure

tests/
├── conftest.py
├── factories/
│   ├── __init__.py
│   ├── media_factory.py
│   ├── transcript_factory.py
│   └── batch_factory.py
├── fixtures/
│   ├── audio/
│   │   ├── sample_5s.wav
│   │   ├── sample_30s.mp3
│   │   └── sample_2m.mp4
│   └── transcripts/
│       └── expected_outputs.json
├── unit/
│   ├── test_protocols.py
│   ├── test_models.py
│   └── services/
│       ├── test_batch.py
│       └── test_whisper.py
└── integration/
    ├── test_pipeline_v1.py
    ├── test_batch_processing.py
    └── test_cli.py

Test Configuration (tests/conftest.py)

import pytest
from pathlib import Path
import asyncio
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

# Test database
TEST_DATABASE_URL = "postgresql://localhost/trax_test"

@pytest.fixture(scope="session")
def event_loop():
    """Create event loop for async tests"""
    loop = asyncio.get_event_loop_policy().new_event_loop()
    yield loop
    loop.close()

@pytest.fixture
def sample_audio_5s():
    """Real 5-second audio file"""
    return Path("tests/fixtures/audio/sample_5s.wav")

@pytest.fixture
def sample_video_2m():
    """Real 2-minute video file"""
    return Path("tests/fixtures/audio/sample_2m.mp4")

@pytest.fixture
def db_session():
    """Test database session"""
    engine = create_engine(TEST_DATABASE_URL)
    Session = sessionmaker(bind=engine)
    session = Session()
    yield session
    session.rollback()
    session.close()

# NO MOCKS - Use real files and services

7. CLI Development

Click-based CLI (src/cli/main.py)

import click
from pathlib import Path
from rich.console import Console
from rich.progress import Progress

console = Console()

@click.group()
@click.version_option(version="0.1.0")
def cli():
    """Trax media processing CLI"""
    pass

@cli.command()
@click.argument('input_path', type=click.Path(exists=True))
@click.option('--batch', is_flag=True, help='Process directory as batch')
@click.option('--version', default='v1', type=click.Choice(['v1', 'v2', 'v3', 'v4']))
@click.option('--output', '-o', type=click.Path(), help='Output directory')
def transcribe(input_path, batch, version, output):
    """Transcribe media file(s)"""
    with Progress() as progress:
        task = progress.add_task("[cyan]Processing...", total=100)
        # Implementation here
        progress.update(task, advance=50)
    
    console.print("[green]✓[/green] Transcription complete!")

@cli.command()
@click.argument('transcript_id')
@click.option('--format', '-f', default='json', type=click.Choice(['json', 'txt']))
@click.option('--output', '-o', type=click.Path())
def export(transcript_id, format, output):
    """Export transcript to file"""
    console.print(f"Exporting {transcript_id} as {format}...")
    # Implementation here

@cli.command()
def status():
    """Show batch processing status"""
    # Implementation here
    console.print("[bold]Active Jobs:[/bold]")

# Usage examples:
# trax transcribe video.mp4
# trax transcribe folder/ --batch
# trax export abc-123 --format txt
# trax status

Enhanced CLI Implementation (Completed - Task 4)

Status: COMPLETED

The enhanced CLI (src/cli/enhanced_cli.py) has been successfully implemented with comprehensive features:

Key Features Implemented:

  • Real-time Progress Reporting: Rich progress bars with time estimates
  • Performance Monitoring: Live CPU, memory, and temperature tracking
  • Intelligent Batch Processing: Concurrent execution with size-based queuing
  • Enhanced Error Handling: User-friendly error messages with actionable guidance
  • Multiple Export Formats: JSON, TXT, SRT, VTT support
  • Advanced Features: Optional speaker diarization and domain adaptation

Implementation Details:

# Enhanced CLI structure
class EnhancedCLI:
    """Main CLI with error handling and performance monitoring"""
    
class EnhancedTranscribeCommand:
    """Single file transcription with progress reporting"""
    
class EnhancedBatchCommand:
    """Batch processing with intelligent queuing"""

Usage Examples:

# Enhanced single file transcription
uv run python -m src.cli.enhanced_cli transcribe input.wav -m large -f srt

# Enhanced batch processing with 8 workers
uv run python -m src.cli.enhanced_cli batch ~/Podcasts -c 8 --diarize

# Academic processing with domain adaptation
uv run python -m src.cli.enhanced_cli transcribe lecture.mp3 --domain academic

Test Coverage: 19 comprehensive test cases with 100% pass rate Code Quality: 483 lines with proper error handling and type hints Integration: Seamless integration with existing transcription services


### 8. Performance Monitoring

#### Metrics Collection
```python
# src/core/metrics.py
from dataclasses import dataclass
from datetime import datetime
from typing import Dict, List
import json

@dataclass
class PerformanceMetric:
    version: str
    file_name: str
    file_size_mb: float
    duration_seconds: float
    processing_time: float
    accuracy_score: float
    timestamp: datetime

class MetricsCollector:
    """Track performance across versions"""
    
    def __init__(self):
        self.metrics: List[PerformanceMetric] = []
    
    def track_transcription(
        self, 
        version: str, 
        file_path: Path,
        processing_time: float,
        accuracy: float = None
    ):
        """Record transcription metrics"""
        file_size = file_path.stat().st_size / (1024 * 1024)
        metric = PerformanceMetric(
            version=version,
            file_name=file_path.name,
            file_size_mb=file_size,
            duration_seconds=self.get_audio_duration(file_path),
            processing_time=processing_time,
            accuracy_score=accuracy or 0.0,
            timestamp=datetime.now()
        )
        self.metrics.append(metric)
    
    def compare_versions(self) -> Dict[str, Dict]:
        """Compare performance across versions"""
        comparison = {}
        for version in ['v1', 'v2', 'v3', 'v4']:
            version_metrics = [m for m in self.metrics if m.version == version]
            if version_metrics:
                avg_speed = sum(m.processing_time for m in version_metrics) / len(version_metrics)
                avg_accuracy = sum(m.accuracy_score for m in version_metrics) / len(version_metrics)
                comparison[version] = {
                    'avg_speed': avg_speed,
                    'avg_accuracy': avg_accuracy,
                    'sample_count': len(version_metrics)
                }
        return comparison
    
    def export_metrics(self, path: Path):
        """Export metrics to JSON"""
        data = [
            {
                'version': m.version,
                'file': m.file_name,
                'size_mb': m.file_size_mb,
                'duration': m.duration_seconds,
                'processing_time': m.processing_time,
                'accuracy': m.accuracy_score,
                'timestamp': m.timestamp.isoformat()
            }
            for m in self.metrics
        ]
        path.write_text(json.dumps(data, indent=2))

9. Migration Timeline

Week 1: Foundation

  • Day 1-2: uv setup, dependencies, project structure
  • Day 3-4: PostgreSQL setup, Alembic, initial schema
  • Day 5: Test infrastructure with real files

Week 2: Core Implementation

  • Day 1-2: Basic transcription service (v1)
  • Day 3-4: Batch processing system
  • Day 5: CLI implementation and testing

Week 3: Enhancement

  • Day 1-2: AI enhancement integration (v2)
  • Day 3-4: Documentation consolidation
  • Day 5: Performance benchmarking

Week 4+: Advanced Features

  • Multi-pass implementation (v3)
  • Speaker diarization (v4)
  • Optimization and refactoring

10. Risk Mitigation

Technical Risks

  1. uv compatibility issues

    • Mitigation: Keep pip requirements.txt as backup
    • Command: uv pip compile pyproject.toml -o requirements.txt
  2. PostgreSQL complexity

    • Mitigation: Start with SQLite option for development
    • Easy switch via DATABASE_URL
  3. Real test file size

    • Mitigation: Keep test files small (<5MB)
    • Use Git LFS if needed
  4. Whisper memory usage

    • Mitigation: Implement chunking early
    • Monitor memory during tests

Process Risks

  1. Documentation drift

    • Mitigation: Update docs with each PR
    • Pre-commit hooks check doc size
  2. Version conflicts

    • Mitigation: Strict protocol compliance
    • Version tests for compatibility
  3. Performance regression

    • Mitigation: Benchmark each version
    • Metrics tracking from day 1

Summary

The technical migration plan provides:

  1. Clear uv migration path with phased dependencies
  2. Comprehensive development setup script
  3. Database migration strategy with Alembic
  4. Real file testing infrastructure
  5. CLI-first development with rich output
  6. Performance monitoring built-in
  7. Risk mitigation strategies

All technical decisions align with the goals of iterability, batch processing, and clean architecture.


Generated: 2024
Status: COMPLETE
Next: Product Vision Report