trax/docs/architecture/error-handling-and-logging.md

# Error Handling and Logging System

## Overview

The Trax platform implements a comprehensive error handling and logging system designed for production reliability, observability, and maintainability. This system provides structured logging, error classification, retry mechanisms, recovery strategies, and performance monitoring.

## Architecture

The error handling and logging system is organized into several key modules:

```
src/
├── logging/
│   ├── __init__.py          # Main logging interface
│   ├── config.py            # Logging configuration
│   ├── utils.py             # Logging utilities
│   └── metrics.py           # Performance metrics
├── errors/
│   ├── __init__.py          # Error system interface
│   ├── base.py              # Base error classes
│   ├── codes.py             # Error codes and categories
│   └── classification.py    # Error classification utilities
├── retry/
│   ├── __init__.py          # Retry system interface
│   ├── base.py              # Retry configuration and strategies
│   └── decorators.py        # Retry decorators
└── recovery/
    ├── __init__.py          # Recovery system interface
    ├── strategies/          # Recovery strategies
    ├── fallbacks/           # Fallback mechanisms
    └── state/               # State recovery
```

## Core Components

### 1. Structured Logging System

The logging system provides structured, contextual logging with file rotation and multiple output formats.

#### Key Features:
- **Structured JSON Logging**: All logs include contextual information (timestamp, module, correlation ID)
- **File Rotation**: Automatic log rotation based on size and time
- **Multiple Output Formats**: JSON for machine processing, human-readable for console
- **Performance Integration**: Built-in performance metrics collection
- **Debug Mode**: Verbose logging for development and troubleshooting

#### Usage:
```python
from src.logging import get_logger, initialize_logging

# Initialize logging system
initialize_logging()

# Get logger
logger = get_logger(__name__)

# Structured logging
logger.info("Processing started", extra={
    "operation": "transcription",
    "file_size": "15.2MB",
    "correlation_id": "req-123"
})
```

#### Configuration:
```python
from src.logging import LoggingConfig

config = LoggingConfig(
    level="INFO",
    log_to_console=True,
    log_to_file=True,
    log_to_json=True,
    log_dir="logs",
    max_file_size=10 * 1024 * 1024,  # 10MB
    backup_count=5
)
```

### 2. Error Classification System

A hierarchical error classification system that provides standardized error handling across the application.

#### Error Hierarchy:
```
TraxError (base)
├── NetworkError
│   ├── ConnectionError
│   ├── TimeoutError
│   └── DNSResolutionError
├── APIError
│   ├── AuthenticationError
│   ├── RateLimitError
│   ├── QuotaExceededError
│   └── ServiceUnavailableError
├── FileSystemError
│   ├── FileNotFoundError
│   ├── PermissionError
│   ├── DiskSpaceError
│   └── CorruptedFileError
├── ValidationError
│   ├── InvalidInputError
│   ├── MissingRequiredFieldError
│   └── FormatError
├── ProcessingError
│   ├── TranscriptionError
│   ├── EnhancementError
│   └── MediaProcessingError
└── ConfigurationError
    ├── MissingConfigError
    ├── InvalidConfigError
    └── EnvironmentError
```

#### Error Codes:
Each error includes a standardized error code for easy identification and handling:

- `TRAX-001`: Network connection failed
- `TRAX-002`: API authentication failed
- `TRAX-003`: File not found
- `TRAX-004`: Invalid input format
- `TRAX-005`: Processing timeout
- `TRAX-006`: Configuration error
- `TRAX-007`: Recovery failed

#### Usage:
```python
from src.errors import (
    NetworkError, APIError, ValidationError,
    create_network_error, create_api_error
)

# Create specific errors
try:
    response = await api_client.call()
except ConnectionError as e:
    raise create_network_error("API connection failed", original_error=e)

# Error classification
from src.errors import classify_error, is_retryable_error

error = classify_error(exception)
if is_retryable_error(error):
    # Implement retry logic
    pass
```

### 3. Retry System

A robust retry system with exponential backoff, jitter, and circuit breaker patterns.

#### Features:
- **Multiple Strategies**: Exponential, linear, constant, and Fibonacci backoff
- **Jitter Support**: Prevents thundering herd problems
- **Circuit Breaker**: Prevents repeated calls to failing services
- **Async Support**: Full async/await compatibility
- **Error Classification**: Automatic retry based on error type

#### Usage:
```python
from src.retry import retry, async_retry, RetryConfig

# Basic retry decorator
@retry(max_retries=3, initial_delay=1.0)
def api_call():
    return external_api.request()

# Async retry with custom config
@async_retry(RetryConfig(
    max_retries=5,
    initial_delay=0.5,
    max_delay=30.0,
    jitter=0.1
))
async def async_api_call():
    return await external_api.async_request()

# Context manager
from src.retry import RetryContext

async with RetryContext(max_retries=3) as retry_ctx:
    result = await retry_ctx.execute(api_function)
```

#### Circuit Breaker:
```python
from src.retry import CircuitBreaker

circuit_breaker = CircuitBreaker(
    failure_threshold=5,
    timeout=60.0,
    expected_exception=NetworkError
)

# Circuit breaker will open after 5 failures
# and allow one test request after 60 seconds
```

### 4. Recovery Strategies

A comprehensive recovery system that provides fallback mechanisms and state recovery for different error scenarios.

#### Recovery Strategies:
- **Fallback Mechanisms**: Alternative service providers, cached responses
- **Graceful Degradation**: Reduce functionality when services are unavailable
- **State Recovery**: Resume interrupted operations from saved state
- **Transaction Rollback**: Automatic rollback of database operations
- **Resource Cleanup**: Automatic cleanup of temporary resources
- **Health Checks**: Proactive monitoring and recovery

#### Usage:
```python
from src.recovery import (
    RecoveryManager, FallbackStrategy, StateRecoveryStrategy,
    create_fallback_strategy
)

# Create recovery manager
recovery_manager = RecoveryManager()

# Add fallback strategy
fallback_strategy = await create_fallback_strategy(
    primary_operation=whisper_transcribe,
    fallback_operations=[basic_transcribe, cached_transcribe]
)
recovery_manager.add_strategy(fallback_strategy)

# Attempt recovery
result = await recovery_manager.attempt_recovery(context)
```

#### Fallback Managers:
```python
from src.recovery import TranscriptionFallbackManager

# Specialized fallback manager for transcription
transcription_fallback = TranscriptionFallbackManager()
await transcription_fallback.add_whisper_fallback(whisper_service)
await transcription_fallback.add_cached_transcription_fallback(cache_store, cache_retrieve)

# Execute with fallbacks
result = await transcription_fallback.execute_with_fallbacks(transcribe_function, audio_file)
```

#### State Recovery:
```python
from src.recovery import StateRecoveryManager, operation_state_context

# Create state recovery manager
state_manager = StateRecoveryManager(storage)

# Track operation state
async with operation_state_context(
    state_manager, "transcription_123", "corr_456", "transcription"
) as state:
    # Operation is automatically tracked
    result = await transcribe_audio(audio_file)
    # State is automatically saved on completion

# Recover interrupted operations
interrupted_ops = await state_manager.list_interrupted_operations()
for op in interrupted_ops:
    recovered_state = await state_manager.recover_operation(op.operation_id)
```

### 5. Performance Metrics

A comprehensive performance monitoring system that tracks operation timing, resource usage, and system health.

#### Features:
- **Operation Timing**: Measure execution time of operations
- **Resource Monitoring**: Track memory and CPU usage
- **System Health**: Periodic monitoring of system metrics
- **Threshold Alerts**: Configurable alerts for performance issues
- **Metrics Export**: JSON export for monitoring systems

#### Usage:
```python
from src.logging import (
    timing_context, async_timing_context,
    timing_decorator, async_timing_decorator,
    start_health_monitoring, export_all_metrics
)

# Context manager for timing
with timing_context("transcription_operation") as timer:
    result = transcribe_audio(audio_file)

# Async timing
async with async_timing_context("api_call"):
    response = await api_client.call()

# Decorator for automatic timing
@timing_decorator("file_processing")
def process_file(file_path):
    return process_large_file(file_path)

# Health monitoring
await start_health_monitoring(interval_seconds=60)

# Export metrics
metrics_json = export_all_metrics()
```

#### Performance Metrics Collected:
- Operation duration (milliseconds)
- Memory usage (MB)
- CPU usage (percentage)
- Success/failure rates
- Operation counters
- System health metrics (CPU, memory, disk usage)

## Integration Patterns

### 1. Service Layer Integration

```python
from src.logging import get_logger, timing_context
from src.errors import create_api_error, classify_error
from src.retry import async_retry
from src.recovery import fallback_context

logger = get_logger(__name__)

class TranscriptionService:
    @async_retry(max_retries=3)
    async def transcribe_audio(self, audio_file: str) -> str:
        with timing_context("transcription_operation") as timer:
            try:
                async with fallback_context(self.fallback_manager):
                    result = await self.whisper_client.transcribe(audio_file)
                    logger.info("Transcription completed", extra={
                        "duration_ms": timer.duration_ms,
                        "file_size": os.path.getsize(audio_file)
                    })
                    return result
            except Exception as e:
                error = classify_error(e)
                logger.error("Transcription failed", extra={
                    "error_code": error.error_code,
                    "error_type": type(e).__name__
                })
                raise create_api_error("Transcription service failed", original_error=e)
```

### 2. CLI Integration

```python
from src.logging import get_logger, initialize_logging
from src.errors import error_handler

logger = get_logger(__name__)

@error_handler
def main():
    initialize_logging()

    try:
        # CLI logic here
        logger.info("CLI operation started")
        # ... processing ...
        logger.info("CLI operation completed")
    except Exception as e:
        logger.error("CLI operation failed", exc_info=True)
        raise
```

### 3. Batch Processing Integration

```python
from src.logging import get_logger, timing_context
from src.recovery import StateRecoveryManager, operation_state_context
from src.errors import create_processing_error

logger = get_logger(__name__)

class BatchProcessor:
    def __init__(self):
        self.state_manager = StateRecoveryManager(storage)

    async def process_batch(self, files: List[str]):
        for file in files:
            async with operation_state_context(
                self.state_manager, f"batch_{file}", "batch_corr", "batch_processing"
            ) as state:
                try:
                    with timing_context("batch_file_processing"):
                        result = await self.process_file(file)
                        logger.info("File processed successfully", extra={
                            "file": file,
                            "result_size": len(result)
                        })
                except Exception as e:
                    logger.error("File processing failed", extra={
                        "file": file,
                        "error": str(e)
                    })
                    raise create_processing_error(f"Failed to process {file}", original_error=e)
```

## Configuration

### Environment Variables

```bash
# Logging configuration
TRAX_LOG_LEVEL=INFO
TRAX_LOG_TO_CONSOLE=true
TRAX_LOG_TO_FILE=true
TRAX_LOG_DIR=logs
TRAX_MAX_LOG_SIZE=10485760  # 10MB
TRAX_LOG_BACKUP_COUNT=5

# Error handling
TRAX_MAX_RETRIES=3
TRAX_INITIAL_RETRY_DELAY=1.0
TRAX_MAX_RETRY_DELAY=30.0
TRAX_RETRY_JITTER=0.1

# Performance monitoring
TRAX_HEALTH_MONITOR_INTERVAL=60
TRAX_METRICS_EXPORT_ENABLED=true
```

### Configuration Files

```json
{
  "logging": {
    "level": "INFO",
    "log_to_console": true,
    "log_to_file": true,
    "log_to_json": true,
    "log_dir": "logs",
    "max_file_size": 10485760,
    "backup_count": 5
  },
  "retry": {
    "max_retries": 3,
    "initial_delay": 1.0,
    "max_delay": 30.0,
    "jitter": 0.1
  },
  "recovery": {
    "enabled": true,
    "max_fallbacks": 3,
    "timeout": 30.0
  },
  "metrics": {
    "enabled": true,
    "health_monitor_interval": 60,
    "export_enabled": true
  }
}
```

## Best Practices

### 1. Error Handling
- Always use specific error types from the error hierarchy
- Include contextual information in error messages
- Use error codes for consistent error identification
- Implement proper error recovery strategies

### 2. Logging
- Use structured logging with contextual information
- Include correlation IDs for request tracing
- Log at appropriate levels (DEBUG, INFO, WARNING, ERROR)
- Use performance metrics for slow operations

### 3. Retry Logic
- Only retry transient errors (network, temporary API failures)
- Use exponential backoff with jitter
- Implement circuit breakers for failing services
- Set appropriate retry limits

### 4. Recovery Strategies
- Implement fallback mechanisms for critical operations
- Use graceful degradation when possible
- Save operation state for recovery
- Clean up resources on failure

### 5. Performance Monitoring
- Monitor all critical operations
- Set appropriate thresholds for alerts
- Export metrics for external monitoring systems
- Use health checks for proactive monitoring

## Testing

### Unit Tests
```python
import pytest
from src.logging import get_logger
from src.errors import NetworkError, create_network_error
from src.retry import retry

def test_error_classification():
    error = create_network_error("Connection failed")
    assert isinstance(error, NetworkError)
    assert error.error_code == "TRAX-001"

def test_retry_logic():
    @retry(max_retries=2)
    def failing_function():
        raise NetworkError("Test error")

    with pytest.raises(NetworkError):
        failing_function()

def test_logging():
    logger = get_logger("test")
    with timing_context("test_operation"):
        # Test operation
        pass
```

### Integration Tests
```python
async def test_recovery_strategies():
    recovery_manager = RecoveryManager()
    # Add test strategies
    # Test recovery scenarios

async def test_performance_monitoring():
    await start_health_monitoring(interval_seconds=1)
    # Perform operations
    # Verify metrics collection
    await stop_health_monitoring()
```

## Monitoring and Alerting

### Metrics Dashboard
- Operation success rates
- Response times
- Error rates by type
- Resource usage (CPU, memory, disk)
- System health status

### Alerts
- High error rates (>5%)
- Slow response times (>30s)
- High resource usage (>90%)
- Service unavailability
- Circuit breaker activations

### Log Analysis
- Error pattern analysis
- Performance bottleneck identification
- Security incident detection
- Usage pattern analysis

## Troubleshooting

### Common Issues

1. **High Memory Usage**
   - Check for memory leaks in long-running operations
   - Monitor memory usage in performance metrics
   - Implement proper resource cleanup

2. **Slow Response Times**
   - Use timing contexts to identify slow operations
   - Check for blocking operations
   - Implement caching where appropriate

3. **High Error Rates**
   - Check error logs for patterns
   - Verify external service availability
   - Review retry and recovery configurations

4. **Log File Issues**
   - Check disk space
   - Verify log rotation configuration
   - Review log level settings

### Debug Mode
```python
from src.logging import enable_debug

# Enable debug mode for detailed logging
enable_debug()

# Debug mode provides:
# - Detailed error stack traces
# - Performance timing for all operations
# - Verbose retry and recovery logs
# - Memory usage tracking
```

## Future Enhancements

1. **Distributed Tracing**: Integration with OpenTelemetry for distributed request tracing
2. **Advanced Metrics**: Custom business metrics and KPIs
3. **Machine Learning**: Anomaly detection for performance issues
4. **Security Logging**: Enhanced security event logging and monitoring
5. **Compliance**: GDPR and other compliance-related logging features