trax/docs/architecture/error-handling-and-logging.md

599 lines
17 KiB
Markdown

# Error Handling and Logging System
## Overview
The Trax platform implements a comprehensive error handling and logging system designed for production reliability, observability, and maintainability. This system provides structured logging, error classification, retry mechanisms, recovery strategies, and performance monitoring.
## Architecture
The error handling and logging system is organized into several key modules:
```
src/
├── logging/
│ ├── __init__.py # Main logging interface
│ ├── config.py # Logging configuration
│ ├── utils.py # Logging utilities
│ └── metrics.py # Performance metrics
├── errors/
│ ├── __init__.py # Error system interface
│ ├── base.py # Base error classes
│ ├── codes.py # Error codes and categories
│ └── classification.py # Error classification utilities
├── retry/
│ ├── __init__.py # Retry system interface
│ ├── base.py # Retry configuration and strategies
│ └── decorators.py # Retry decorators
└── recovery/
├── __init__.py # Recovery system interface
├── strategies/ # Recovery strategies
├── fallbacks/ # Fallback mechanisms
└── state/ # State recovery
```
## Core Components
### 1. Structured Logging System
The logging system provides structured, contextual logging with file rotation and multiple output formats.
#### Key Features:
- **Structured JSON Logging**: All logs include contextual information (timestamp, module, correlation ID)
- **File Rotation**: Automatic log rotation based on size and time
- **Multiple Output Formats**: JSON for machine processing, human-readable for console
- **Performance Integration**: Built-in performance metrics collection
- **Debug Mode**: Verbose logging for development and troubleshooting
#### Usage:
```python
from src.logging import get_logger, initialize_logging
# Initialize logging system
initialize_logging()
# Get logger
logger = get_logger(__name__)
# Structured logging
logger.info("Processing started", extra={
"operation": "transcription",
"file_size": "15.2MB",
"correlation_id": "req-123"
})
```
#### Configuration:
```python
from src.logging import LoggingConfig
config = LoggingConfig(
level="INFO",
log_to_console=True,
log_to_file=True,
log_to_json=True,
log_dir="logs",
max_file_size=10 * 1024 * 1024, # 10MB
backup_count=5
)
```
### 2. Error Classification System
A hierarchical error classification system that provides standardized error handling across the application.
#### Error Hierarchy:
```
TraxError (base)
├── NetworkError
│ ├── ConnectionError
│ ├── TimeoutError
│ └── DNSResolutionError
├── APIError
│ ├── AuthenticationError
│ ├── RateLimitError
│ ├── QuotaExceededError
│ └── ServiceUnavailableError
├── FileSystemError
│ ├── FileNotFoundError
│ ├── PermissionError
│ ├── DiskSpaceError
│ └── CorruptedFileError
├── ValidationError
│ ├── InvalidInputError
│ ├── MissingRequiredFieldError
│ └── FormatError
├── ProcessingError
│ ├── TranscriptionError
│ ├── EnhancementError
│ └── MediaProcessingError
└── ConfigurationError
├── MissingConfigError
├── InvalidConfigError
└── EnvironmentError
```
#### Error Codes:
Each error includes a standardized error code for easy identification and handling:
- `TRAX-001`: Network connection failed
- `TRAX-002`: API authentication failed
- `TRAX-003`: File not found
- `TRAX-004`: Invalid input format
- `TRAX-005`: Processing timeout
- `TRAX-006`: Configuration error
- `TRAX-007`: Recovery failed
#### Usage:
```python
from src.errors import (
NetworkError, APIError, ValidationError,
create_network_error, create_api_error
)
# Create specific errors
try:
response = await api_client.call()
except ConnectionError as e:
raise create_network_error("API connection failed", original_error=e)
# Error classification
from src.errors import classify_error, is_retryable_error
error = classify_error(exception)
if is_retryable_error(error):
# Implement retry logic
pass
```
### 3. Retry System
A robust retry system with exponential backoff, jitter, and circuit breaker patterns.
#### Features:
- **Multiple Strategies**: Exponential, linear, constant, and Fibonacci backoff
- **Jitter Support**: Prevents thundering herd problems
- **Circuit Breaker**: Prevents repeated calls to failing services
- **Async Support**: Full async/await compatibility
- **Error Classification**: Automatic retry based on error type
#### Usage:
```python
from src.retry import retry, async_retry, RetryConfig
# Basic retry decorator
@retry(max_retries=3, initial_delay=1.0)
def api_call():
return external_api.request()
# Async retry with custom config
@async_retry(RetryConfig(
max_retries=5,
initial_delay=0.5,
max_delay=30.0,
jitter=0.1
))
async def async_api_call():
return await external_api.async_request()
# Context manager
from src.retry import RetryContext
async with RetryContext(max_retries=3) as retry_ctx:
result = await retry_ctx.execute(api_function)
```
#### Circuit Breaker:
```python
from src.retry import CircuitBreaker
circuit_breaker = CircuitBreaker(
failure_threshold=5,
timeout=60.0,
expected_exception=NetworkError
)
# Circuit breaker will open after 5 failures
# and allow one test request after 60 seconds
```
### 4. Recovery Strategies
A comprehensive recovery system that provides fallback mechanisms and state recovery for different error scenarios.
#### Recovery Strategies:
- **Fallback Mechanisms**: Alternative service providers, cached responses
- **Graceful Degradation**: Reduce functionality when services are unavailable
- **State Recovery**: Resume interrupted operations from saved state
- **Transaction Rollback**: Automatic rollback of database operations
- **Resource Cleanup**: Automatic cleanup of temporary resources
- **Health Checks**: Proactive monitoring and recovery
#### Usage:
```python
from src.recovery import (
RecoveryManager, FallbackStrategy, StateRecoveryStrategy,
create_fallback_strategy
)
# Create recovery manager
recovery_manager = RecoveryManager()
# Add fallback strategy
fallback_strategy = await create_fallback_strategy(
primary_operation=whisper_transcribe,
fallback_operations=[basic_transcribe, cached_transcribe]
)
recovery_manager.add_strategy(fallback_strategy)
# Attempt recovery
result = await recovery_manager.attempt_recovery(context)
```
#### Fallback Managers:
```python
from src.recovery import TranscriptionFallbackManager
# Specialized fallback manager for transcription
transcription_fallback = TranscriptionFallbackManager()
await transcription_fallback.add_whisper_fallback(whisper_service)
await transcription_fallback.add_cached_transcription_fallback(cache_store, cache_retrieve)
# Execute with fallbacks
result = await transcription_fallback.execute_with_fallbacks(transcribe_function, audio_file)
```
#### State Recovery:
```python
from src.recovery import StateRecoveryManager, operation_state_context
# Create state recovery manager
state_manager = StateRecoveryManager(storage)
# Track operation state
async with operation_state_context(
state_manager, "transcription_123", "corr_456", "transcription"
) as state:
# Operation is automatically tracked
result = await transcribe_audio(audio_file)
# State is automatically saved on completion
# Recover interrupted operations
interrupted_ops = await state_manager.list_interrupted_operations()
for op in interrupted_ops:
recovered_state = await state_manager.recover_operation(op.operation_id)
```
### 5. Performance Metrics
A comprehensive performance monitoring system that tracks operation timing, resource usage, and system health.
#### Features:
- **Operation Timing**: Measure execution time of operations
- **Resource Monitoring**: Track memory and CPU usage
- **System Health**: Periodic monitoring of system metrics
- **Threshold Alerts**: Configurable alerts for performance issues
- **Metrics Export**: JSON export for monitoring systems
#### Usage:
```python
from src.logging import (
timing_context, async_timing_context,
timing_decorator, async_timing_decorator,
start_health_monitoring, export_all_metrics
)
# Context manager for timing
with timing_context("transcription_operation") as timer:
result = transcribe_audio(audio_file)
# Async timing
async with async_timing_context("api_call"):
response = await api_client.call()
# Decorator for automatic timing
@timing_decorator("file_processing")
def process_file(file_path):
return process_large_file(file_path)
# Health monitoring
await start_health_monitoring(interval_seconds=60)
# Export metrics
metrics_json = export_all_metrics()
```
#### Performance Metrics Collected:
- Operation duration (milliseconds)
- Memory usage (MB)
- CPU usage (percentage)
- Success/failure rates
- Operation counters
- System health metrics (CPU, memory, disk usage)
## Integration Patterns
### 1. Service Layer Integration
```python
from src.logging import get_logger, timing_context
from src.errors import create_api_error, classify_error
from src.retry import async_retry
from src.recovery import fallback_context
logger = get_logger(__name__)
class TranscriptionService:
@async_retry(max_retries=3)
async def transcribe_audio(self, audio_file: str) -> str:
with timing_context("transcription_operation") as timer:
try:
async with fallback_context(self.fallback_manager):
result = await self.whisper_client.transcribe(audio_file)
logger.info("Transcription completed", extra={
"duration_ms": timer.duration_ms,
"file_size": os.path.getsize(audio_file)
})
return result
except Exception as e:
error = classify_error(e)
logger.error("Transcription failed", extra={
"error_code": error.error_code,
"error_type": type(e).__name__
})
raise create_api_error("Transcription service failed", original_error=e)
```
### 2. CLI Integration
```python
from src.logging import get_logger, initialize_logging
from src.errors import error_handler
logger = get_logger(__name__)
@error_handler
def main():
initialize_logging()
try:
# CLI logic here
logger.info("CLI operation started")
# ... processing ...
logger.info("CLI operation completed")
except Exception as e:
logger.error("CLI operation failed", exc_info=True)
raise
```
### 3. Batch Processing Integration
```python
from src.logging import get_logger, timing_context
from src.recovery import StateRecoveryManager, operation_state_context
from src.errors import create_processing_error
logger = get_logger(__name__)
class BatchProcessor:
def __init__(self):
self.state_manager = StateRecoveryManager(storage)
async def process_batch(self, files: List[str]):
for file in files:
async with operation_state_context(
self.state_manager, f"batch_{file}", "batch_corr", "batch_processing"
) as state:
try:
with timing_context("batch_file_processing"):
result = await self.process_file(file)
logger.info("File processed successfully", extra={
"file": file,
"result_size": len(result)
})
except Exception as e:
logger.error("File processing failed", extra={
"file": file,
"error": str(e)
})
raise create_processing_error(f"Failed to process {file}", original_error=e)
```
## Configuration
### Environment Variables
```bash
# Logging configuration
TRAX_LOG_LEVEL=INFO
TRAX_LOG_TO_CONSOLE=true
TRAX_LOG_TO_FILE=true
TRAX_LOG_DIR=logs
TRAX_MAX_LOG_SIZE=10485760 # 10MB
TRAX_LOG_BACKUP_COUNT=5
# Error handling
TRAX_MAX_RETRIES=3
TRAX_INITIAL_RETRY_DELAY=1.0
TRAX_MAX_RETRY_DELAY=30.0
TRAX_RETRY_JITTER=0.1
# Performance monitoring
TRAX_HEALTH_MONITOR_INTERVAL=60
TRAX_METRICS_EXPORT_ENABLED=true
```
### Configuration Files
```json
{
"logging": {
"level": "INFO",
"log_to_console": true,
"log_to_file": true,
"log_to_json": true,
"log_dir": "logs",
"max_file_size": 10485760,
"backup_count": 5
},
"retry": {
"max_retries": 3,
"initial_delay": 1.0,
"max_delay": 30.0,
"jitter": 0.1
},
"recovery": {
"enabled": true,
"max_fallbacks": 3,
"timeout": 30.0
},
"metrics": {
"enabled": true,
"health_monitor_interval": 60,
"export_enabled": true
}
}
```
## Best Practices
### 1. Error Handling
- Always use specific error types from the error hierarchy
- Include contextual information in error messages
- Use error codes for consistent error identification
- Implement proper error recovery strategies
### 2. Logging
- Use structured logging with contextual information
- Include correlation IDs for request tracing
- Log at appropriate levels (DEBUG, INFO, WARNING, ERROR)
- Use performance metrics for slow operations
### 3. Retry Logic
- Only retry transient errors (network, temporary API failures)
- Use exponential backoff with jitter
- Implement circuit breakers for failing services
- Set appropriate retry limits
### 4. Recovery Strategies
- Implement fallback mechanisms for critical operations
- Use graceful degradation when possible
- Save operation state for recovery
- Clean up resources on failure
### 5. Performance Monitoring
- Monitor all critical operations
- Set appropriate thresholds for alerts
- Export metrics for external monitoring systems
- Use health checks for proactive monitoring
## Testing
### Unit Tests
```python
import pytest
from src.logging import get_logger
from src.errors import NetworkError, create_network_error
from src.retry import retry
def test_error_classification():
error = create_network_error("Connection failed")
assert isinstance(error, NetworkError)
assert error.error_code == "TRAX-001"
def test_retry_logic():
@retry(max_retries=2)
def failing_function():
raise NetworkError("Test error")
with pytest.raises(NetworkError):
failing_function()
def test_logging():
logger = get_logger("test")
with timing_context("test_operation"):
# Test operation
pass
```
### Integration Tests
```python
async def test_recovery_strategies():
recovery_manager = RecoveryManager()
# Add test strategies
# Test recovery scenarios
async def test_performance_monitoring():
await start_health_monitoring(interval_seconds=1)
# Perform operations
# Verify metrics collection
await stop_health_monitoring()
```
## Monitoring and Alerting
### Metrics Dashboard
- Operation success rates
- Response times
- Error rates by type
- Resource usage (CPU, memory, disk)
- System health status
### Alerts
- High error rates (>5%)
- Slow response times (>30s)
- High resource usage (>90%)
- Service unavailability
- Circuit breaker activations
### Log Analysis
- Error pattern analysis
- Performance bottleneck identification
- Security incident detection
- Usage pattern analysis
## Troubleshooting
### Common Issues
1. **High Memory Usage**
- Check for memory leaks in long-running operations
- Monitor memory usage in performance metrics
- Implement proper resource cleanup
2. **Slow Response Times**
- Use timing contexts to identify slow operations
- Check for blocking operations
- Implement caching where appropriate
3. **High Error Rates**
- Check error logs for patterns
- Verify external service availability
- Review retry and recovery configurations
4. **Log File Issues**
- Check disk space
- Verify log rotation configuration
- Review log level settings
### Debug Mode
```python
from src.logging import enable_debug
# Enable debug mode for detailed logging
enable_debug()
# Debug mode provides:
# - Detailed error stack traces
# - Performance timing for all operations
# - Verbose retry and recovery logs
# - Memory usage tracking
```
## Future Enhancements
1. **Distributed Tracing**: Integration with OpenTelemetry for distributed request tracing
2. **Advanced Metrics**: Custom business metrics and KPIs
3. **Machine Learning**: Anomaly detection for performance issues
4. **Security Logging**: Enhanced security event logging and monitoring
5. **Compliance**: GDPR and other compliance-related logging features