trax/docs/reports/05-technical-migration.md

# Checkpoint 5: Technical Migration Report

## uv Package Manager Migration & Development Setup

### 1. Migration from pip to uv

#### Current State
- Trax already has `pyproject.toml` configured for uv
- Basic `[tool.uv]` section present
- Development dependencies defined
- Virtual environment in `.venv/`

#### Migration Steps

##### Phase 1: Core Dependencies
```toml
[project]
name = "trax"
version = "0.1.0"
description = "Media transcription platform with iterative enhancement"
readme = "README.md"
requires-python = ">=3.11"
dependencies = [
    "python-dotenv>=1.0.0",
    "sqlalchemy>=2.0.0",
    "alembic>=1.13.0",
    "psycopg2-binary>=2.9.0",
    "pydantic>=2.0.0",
    "click>=8.1.0",
    "rich>=13.0.0",  # For CLI output
    "asyncio>=3.4.3",
]
```

##### Phase 2: Transcription Dependencies
```toml
dependencies += [
    "faster-whisper>=1.0.0",
    "yt-dlp>=2024.0.0",
    "ffmpeg-python>=0.2.0",
    "pydub>=0.25.0",
    "librosa>=0.10.0",  # Audio analysis
    "numpy>=1.24.0",
    "scipy>=1.11.0",
]
```

##### Phase 3: AI Enhancement
```toml
dependencies += [
    "openai>=1.0.0",     # For DeepSeek API
    "aiohttp>=3.9.0",
    "tenacity>=8.2.0",   # Retry logic
    "jinja2>=3.1.0",     # Templates
]
```

##### Phase 4: Advanced Features
```toml
dependencies += [
    "pyannote.audio>=3.0.0",  # Speaker diarization
    "torch>=2.0.0",           # For ML models
    "torchaudio>=2.0.0",
]
```

#### Migration Commands
```bash
# Initial setup
cd apps/trax
uv venv                               # Create venv
source .venv/bin/activate             # Activate
uv pip sync                          # Install from lock
uv pip compile pyproject.toml -o requirements.txt  # Generate lock

# Adding new packages
uv pip install package-name
uv pip compile pyproject.toml -o requirements.txt

# Development workflow
uv run pytest                        # Run tests
uv run python src/cli/main.py       # Run CLI
uv run black src/ tests/            # Format code
uv run ruff check src/ tests/       # Lint
uv run mypy src/                    # Type check
```

### 2. Documentation Consolidation

#### Current Documentation Status
- `CLAUDE.md`: 97 lines (well under 600 limit)
- `AGENTS.md`: 163 lines (well under 600 limit)
- **Total**: 260 lines (can add 340 more)

#### Consolidation Strategy

##### Enhanced CLAUDE.md (~400 lines)
```markdown
# CLAUDE.md

## Project Context (existing ~50 lines)
## Architecture Overview (NEW ~100 lines)
  - Service protocols
  - Pipeline versions (v1-v4)
  - Database schema
  - Batch processing design
## Essential Commands (existing ~30 lines)
## Development Workflow (NEW ~80 lines)
  - Iteration strategy
  - Testing approach
  - Batch processing
  - Version management
## API Reference (NEW ~80 lines)
  - CLI commands
  - Service interfaces
  - Protocol definitions
## Performance Targets (NEW ~40 lines)
  - Speed benchmarks
  - Accuracy goals
  - Resource limits
```

##### Enhanced AGENTS.md (~200 lines)
```markdown
# AGENTS.md

## Development Rules (NEW ~50 lines)
  - Links to rule files
  - Quick reference
## Setup Commands (existing ~40 lines)
## Code Style (existing ~30 lines)
## Common Workflows (existing ~40 lines)
## Troubleshooting (NEW ~40 lines)
```

### 3. Code Quality Standards

#### Tool Configuration
```toml
# pyproject.toml additions

[tool.black]
line-length = 100
target-version = ['py311']
include = '\.pyi?$'
extend-exclude = '''
/(
  migrations
  | .venv
  | data
)/
'''

[tool.ruff]
line-length = 100
select = ["E", "F", "I", "N", "W", "B", "C90", "D"]
ignore = ["E501", "D100", "D104"]
exclude = ["migrations", ".venv", "data"]
fix = true
fixable = ["ALL"]

[tool.mypy]
python_version = "3.11"
warn_return_any = true
warn_unused_configs = true
disallow_untyped_defs = true
ignore_missing_imports = true
plugins = ["pydantic.mypy", "sqlalchemy.ext.mypy.plugin"]
exclude = ["migrations", "tests"]

[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py", "*_test.py"]
addopts = """
-v
--cov=src
--cov-report=html
--cov-report=term
--tb=short
"""
asyncio_mode = "auto"
markers = [
    "unit: Unit tests",
    "integration: Integration tests",
    "slow: Slow tests (>5s)",
    "batch: Batch processing tests",
]

[tool.coverage.run]
omit = [
    "*/tests/*",
    "*/migrations/*",
    "*/__pycache__/*",
]

[tool.coverage.report]
exclude_lines = [
    "pragma: no cover",
    "def __repr__",
    "raise AssertionError",
    "raise NotImplementedError",
    "if __name__ == .__main__.:",
]
```

### 4. Development Environment Setup

#### Setup Script (`scripts/setup_dev.sh`)
```bash
#!/bin/bash
set -e

echo "🚀 Setting up Trax development environment..."

# Color codes
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

# Check Python version
python_version=$(python3 --version | cut -d' ' -f2 | cut -d'.' -f1,2)
required_version="3.11"
if [ "$(printf '%s\n' "$required_version" "$python_version" | sort -V | head -n1)" != "$required_version" ]; then
    echo -e "${RED}❌ Python 3.11+ required (found $python_version)${NC}"
    exit 1
fi
echo -e "${GREEN}✅ Python $python_version${NC}"

# Install uv if needed
if ! command -v uv &> /dev/null; then
    echo -e "${YELLOW}📦 Installing uv...${NC}"
    curl -LsSf https://astral.sh/uv/install.sh | sh
    export PATH="$HOME/.cargo/bin:$PATH"
fi
echo -e "${GREEN}✅ uv installed${NC}"

# Setup virtual environment
echo -e "${YELLOW}🔧 Creating virtual environment...${NC}"
uv venv
source .venv/bin/activate

# Install dependencies
echo -e "${YELLOW}📚 Installing dependencies...${NC}"
uv pip install -e ".[dev]"

# Setup pre-commit hooks
echo -e "${YELLOW}🪝 Setting up pre-commit hooks...${NC}"
cat > .git/hooks/pre-commit << 'EOF'
#!/bin/bash
source .venv/bin/activate
echo "Running pre-commit checks..."
uv run black --check src/ tests/
uv run ruff check src/ tests/
uv run mypy src/
EOF
chmod +x .git/hooks/pre-commit

# Create directories
echo -e "${YELLOW}📁 Creating project directories...${NC}"
mkdir -p data/{media,exports,cache}
mkdir -p tests/{unit,integration,fixtures/audio,fixtures/transcripts}
mkdir -p src/agents/rules
mkdir -p docs/{reports,team,architecture}

# Check PostgreSQL
if command -v psql &> /dev/null; then
    echo -e "${GREEN}✅ PostgreSQL installed${NC}"
else
    echo -e "${YELLOW}⚠️  PostgreSQL not found - please install${NC}"
fi

# Check FFmpeg
if command -v ffmpeg &> /dev/null; then
    echo -e "${GREEN}✅ FFmpeg installed${NC}"
else
    echo -e "${YELLOW}⚠️  FFmpeg not found - please install${NC}"
fi

# Setup test data
echo -e "${YELLOW}🎵 Setting up test fixtures...${NC}"
cat > tests/fixtures/README.md << 'EOF'
# Test Fixtures

Place test audio files here:
- sample_5s.wav (5-second test)
- sample_30s.mp3 (30-second test)
- sample_2m.mp4 (2-minute test)

These should be real audio files for testing.
EOF

echo -e "${GREEN}✅ Development environment ready!${NC}"
echo ""
echo "📝 Next steps:"
echo "  1. source .venv/bin/activate"
echo "  2. Set up PostgreSQL database"
echo "  3. Add test audio files to tests/fixtures/audio/"
echo "  4. uv run pytest  # Run tests"
echo "  5. uv run python src/cli/main.py --help  # Run CLI"
```

### 5. Database Migration Strategy

#### Alembic Setup
```python
# alembic.ini
[alembic]
script_location = migrations
prepend_sys_path = .
version_path_separator = os
sqlalchemy.url = postgresql://localhost/trax

[loggers]
keys = root,sqlalchemy,alembic

[handlers]
keys = console

[formatters]
keys = generic

[logger_root]
level = WARN
handlers = console
qualname =

[logger_sqlalchemy]
level = WARN
handlers =
qualname = sqlalchemy.engine

[logger_alembic]
level = INFO
handlers =
qualname = alembic

[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = NOTSET
formatter = generic

[formatter_generic]
format = %(levelname)-5.5s [%(name)s] %(message)s
datefmt = %H:%M:%S
```

#### Migration Sequence
```bash
# Phase 1: Core tables
alembic revision -m "create_media_and_transcripts"
# Creates: media_files, transcripts, exports

# Phase 2: Batch processing
alembic revision -m "add_batch_processing"
# Creates: batch_jobs, batch_items

# Phase 3: Audio metadata
alembic revision -m "add_audio_metadata"
# Creates: audio_processing_metadata

# Phase 4: Enhancement tracking
alembic revision -m "add_enhancement_fields"
# Adds: enhanced_content column

# Phase 5: Multi-pass support
alembic revision -m "add_multipass_tables"
# Creates: multipass_runs

# Phase 6: Diarization
alembic revision -m "add_speaker_diarization"
# Creates: speaker_profiles

# Commands
alembic upgrade head    # Apply all migrations
alembic current        # Show current version
alembic history        # Show migration history
alembic downgrade -1   # Rollback one migration
```

### 6. Testing Infrastructure

#### Test File Structure
```
tests/
├── conftest.py
├── factories/
│   ├── __init__.py
│   ├── media_factory.py
│   ├── transcript_factory.py
│   └── batch_factory.py
├── fixtures/
│   ├── audio/
│   │   ├── sample_5s.wav
│   │   ├── sample_30s.mp3
│   │   └── sample_2m.mp4
│   └── transcripts/
│       └── expected_outputs.json
├── unit/
│   ├── test_protocols.py
│   ├── test_models.py
│   └── services/
│       ├── test_batch.py
│       └── test_whisper.py
└── integration/
    ├── test_pipeline_v1.py
    ├── test_batch_processing.py
    └── test_cli.py
```

#### Test Configuration (`tests/conftest.py`)
```python
import pytest
from pathlib import Path
import asyncio
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

# Test database
TEST_DATABASE_URL = "postgresql://localhost/trax_test"

@pytest.fixture(scope="session")
def event_loop():
    """Create event loop for async tests"""
    loop = asyncio.get_event_loop_policy().new_event_loop()
    yield loop
    loop.close()

@pytest.fixture
def sample_audio_5s():
    """Real 5-second audio file"""
    return Path("tests/fixtures/audio/sample_5s.wav")

@pytest.fixture
def sample_video_2m():
    """Real 2-minute video file"""
    return Path("tests/fixtures/audio/sample_2m.mp4")

@pytest.fixture
def db_session():
    """Test database session"""
    engine = create_engine(TEST_DATABASE_URL)
    Session = sessionmaker(bind=engine)
    session = Session()
    yield session
    session.rollback()
    session.close()

# NO MOCKS - Use real files and services
```

### 7. CLI Development

#### Click-based CLI (`src/cli/main.py`)
```python
import click
from pathlib import Path
from rich.console import Console
from rich.progress import Progress

console = Console()

@click.group()
@click.version_option(version="0.1.0")
def cli():
    """Trax media processing CLI"""
    pass

@cli.command()
@click.argument('input_path', type=click.Path(exists=True))
@click.option('--batch', is_flag=True, help='Process directory as batch')
@click.option('--version', default='v1', type=click.Choice(['v1', 'v2', 'v3', 'v4']))
@click.option('--output', '-o', type=click.Path(), help='Output directory')
def transcribe(input_path, batch, version, output):
    """Transcribe media file(s)"""
    with Progress() as progress:
        task = progress.add_task("[cyan]Processing...", total=100)
        # Implementation here
        progress.update(task, advance=50)

    console.print("[green]✓[/green] Transcription complete!")

@cli.command()
@click.argument('transcript_id')
@click.option('--format', '-f', default='json', type=click.Choice(['json', 'txt']))
@click.option('--output', '-o', type=click.Path())
def export(transcript_id, format, output):
    """Export transcript to file"""
    console.print(f"Exporting {transcript_id} as {format}...")
    # Implementation here

@cli.command()
def status():
    """Show batch processing status"""
    # Implementation here
    console.print("[bold]Active Jobs:[/bold]")

# Usage examples:
# trax transcribe video.mp4
# trax transcribe folder/ --batch
# trax export abc-123 --format txt
# trax status
```

#### Enhanced CLI Implementation (Completed - Task 4)

**Status**: ✅ **COMPLETED**

The enhanced CLI (`src/cli/enhanced_cli.py`) has been successfully implemented with comprehensive features:

**Key Features Implemented:**
- **Real-time Progress Reporting**: Rich progress bars with time estimates
- **Performance Monitoring**: Live CPU, memory, and temperature tracking
- **Intelligent Batch Processing**: Concurrent execution with size-based queuing
- **Enhanced Error Handling**: User-friendly error messages with actionable guidance
- **Multiple Export Formats**: JSON, TXT, SRT, VTT support
- **Advanced Features**: Optional speaker diarization and domain adaptation

**Implementation Details:**
```python
# Enhanced CLI structure
class EnhancedCLI:
    """Main CLI with error handling and performance monitoring"""

class EnhancedTranscribeCommand:
    """Single file transcription with progress reporting"""

class EnhancedBatchCommand:
    """Batch processing with intelligent queuing"""
```

**Usage Examples:**
```bash
# Enhanced single file transcription
uv run python -m src.cli.enhanced_cli transcribe input.wav -m large -f srt

# Enhanced batch processing with 8 workers
uv run python -m src.cli.enhanced_cli batch ~/Podcasts -c 8 --diarize

# Academic processing with domain adaptation
uv run python -m src.cli.enhanced_cli transcribe lecture.mp3 --domain academic
```

**Test Coverage**: 19 comprehensive test cases with 100% pass rate
**Code Quality**: 483 lines with proper error handling and type hints
**Integration**: Seamless integration with existing transcription services
```

### 8. Performance Monitoring

#### Metrics Collection
```python
# src/core/metrics.py
from dataclasses import dataclass
from datetime import datetime
from typing import Dict, List
import json

@dataclass
class PerformanceMetric:
    version: str
    file_name: str
    file_size_mb: float
    duration_seconds: float
    processing_time: float
    accuracy_score: float
    timestamp: datetime

class MetricsCollector:
    """Track performance across versions"""

    def __init__(self):
        self.metrics: List[PerformanceMetric] = []

    def track_transcription(
        self,
        version: str,
        file_path: Path,
        processing_time: float,
        accuracy: float = None
    ):
        """Record transcription metrics"""
        file_size = file_path.stat().st_size / (1024 * 1024)
        metric = PerformanceMetric(
            version=version,
            file_name=file_path.name,
            file_size_mb=file_size,
            duration_seconds=self.get_audio_duration(file_path),
            processing_time=processing_time,
            accuracy_score=accuracy or 0.0,
            timestamp=datetime.now()
        )
        self.metrics.append(metric)

    def compare_versions(self) -> Dict[str, Dict]:
        """Compare performance across versions"""
        comparison = {}
        for version in ['v1', 'v2', 'v3', 'v4']:
            version_metrics = [m for m in self.metrics if m.version == version]
            if version_metrics:
                avg_speed = sum(m.processing_time for m in version_metrics) / len(version_metrics)
                avg_accuracy = sum(m.accuracy_score for m in version_metrics) / len(version_metrics)
                comparison[version] = {
                    'avg_speed': avg_speed,
                    'avg_accuracy': avg_accuracy,
                    'sample_count': len(version_metrics)
                }
        return comparison

    def export_metrics(self, path: Path):
        """Export metrics to JSON"""
        data = [
            {
                'version': m.version,
                'file': m.file_name,
                'size_mb': m.file_size_mb,
                'duration': m.duration_seconds,
                'processing_time': m.processing_time,
                'accuracy': m.accuracy_score,
                'timestamp': m.timestamp.isoformat()
            }
            for m in self.metrics
        ]
        path.write_text(json.dumps(data, indent=2))
```

### 9. Migration Timeline

#### Week 1: Foundation
- **Day 1-2**: uv setup, dependencies, project structure
- **Day 3-4**: PostgreSQL setup, Alembic, initial schema
- **Day 5**: Test infrastructure with real files

#### Week 2: Core Implementation
- **Day 1-2**: Basic transcription service (v1)
- **Day 3-4**: Batch processing system
- **Day 5**: CLI implementation and testing

#### Week 3: Enhancement
- **Day 1-2**: AI enhancement integration (v2)
- **Day 3-4**: Documentation consolidation
- **Day 5**: Performance benchmarking

#### Week 4+: Advanced Features
- Multi-pass implementation (v3)
- Speaker diarization (v4)
- Optimization and refactoring

### 10. Risk Mitigation

#### Technical Risks
1. **uv compatibility issues**
   - Mitigation: Keep pip requirements.txt as backup
   - Command: `uv pip compile pyproject.toml -o requirements.txt`

2. **PostgreSQL complexity**
   - Mitigation: Start with SQLite option for development
   - Easy switch via DATABASE_URL

3. **Real test file size**
   - Mitigation: Keep test files small (<5MB)
   - Use Git LFS if needed

4. **Whisper memory usage**
   - Mitigation: Implement chunking early
   - Monitor memory during tests

#### Process Risks
1. **Documentation drift**
   - Mitigation: Update docs with each PR
   - Pre-commit hooks check doc size

2. **Version conflicts**
   - Mitigation: Strict protocol compliance
   - Version tests for compatibility

3. **Performance regression**
   - Mitigation: Benchmark each version
   - Metrics tracking from day 1

### Summary

The technical migration plan provides:
1. **Clear uv migration path** with phased dependencies
2. **Comprehensive development setup** script
3. **Database migration strategy** with Alembic
4. **Real file testing** infrastructure
5. **CLI-first development** with rich output
6. **Performance monitoring** built-in
7. **Risk mitigation** strategies

All technical decisions align with the goals of iterability, batch processing, and clean architecture.

---

*Generated: 2024*
*Status: COMPLETE*
*Next: Product Vision Report*