599 lines
17 KiB
Markdown
599 lines
17 KiB
Markdown
# Error Handling and Logging System
|
|
|
|
## Overview
|
|
|
|
The Trax platform implements a comprehensive error handling and logging system designed for production reliability, observability, and maintainability. This system provides structured logging, error classification, retry mechanisms, recovery strategies, and performance monitoring.
|
|
|
|
## Architecture
|
|
|
|
The error handling and logging system is organized into several key modules:
|
|
|
|
```
|
|
src/
|
|
├── logging/
|
|
│ ├── __init__.py # Main logging interface
|
|
│ ├── config.py # Logging configuration
|
|
│ ├── utils.py # Logging utilities
|
|
│ └── metrics.py # Performance metrics
|
|
├── errors/
|
|
│ ├── __init__.py # Error system interface
|
|
│ ├── base.py # Base error classes
|
|
│ ├── codes.py # Error codes and categories
|
|
│ └── classification.py # Error classification utilities
|
|
├── retry/
|
|
│ ├── __init__.py # Retry system interface
|
|
│ ├── base.py # Retry configuration and strategies
|
|
│ └── decorators.py # Retry decorators
|
|
└── recovery/
|
|
├── __init__.py # Recovery system interface
|
|
├── strategies/ # Recovery strategies
|
|
├── fallbacks/ # Fallback mechanisms
|
|
└── state/ # State recovery
|
|
```
|
|
|
|
## Core Components
|
|
|
|
### 1. Structured Logging System
|
|
|
|
The logging system provides structured, contextual logging with file rotation and multiple output formats.
|
|
|
|
#### Key Features:
|
|
- **Structured JSON Logging**: All logs include contextual information (timestamp, module, correlation ID)
|
|
- **File Rotation**: Automatic log rotation based on size and time
|
|
- **Multiple Output Formats**: JSON for machine processing, human-readable for console
|
|
- **Performance Integration**: Built-in performance metrics collection
|
|
- **Debug Mode**: Verbose logging for development and troubleshooting
|
|
|
|
#### Usage:
|
|
```python
|
|
from src.logging import get_logger, initialize_logging
|
|
|
|
# Initialize logging system
|
|
initialize_logging()
|
|
|
|
# Get logger
|
|
logger = get_logger(__name__)
|
|
|
|
# Structured logging
|
|
logger.info("Processing started", extra={
|
|
"operation": "transcription",
|
|
"file_size": "15.2MB",
|
|
"correlation_id": "req-123"
|
|
})
|
|
```
|
|
|
|
#### Configuration:
|
|
```python
|
|
from src.logging import LoggingConfig
|
|
|
|
config = LoggingConfig(
|
|
level="INFO",
|
|
log_to_console=True,
|
|
log_to_file=True,
|
|
log_to_json=True,
|
|
log_dir="logs",
|
|
max_file_size=10 * 1024 * 1024, # 10MB
|
|
backup_count=5
|
|
)
|
|
```
|
|
|
|
### 2. Error Classification System
|
|
|
|
A hierarchical error classification system that provides standardized error handling across the application.
|
|
|
|
#### Error Hierarchy:
|
|
```
|
|
TraxError (base)
|
|
├── NetworkError
|
|
│ ├── ConnectionError
|
|
│ ├── TimeoutError
|
|
│ └── DNSResolutionError
|
|
├── APIError
|
|
│ ├── AuthenticationError
|
|
│ ├── RateLimitError
|
|
│ ├── QuotaExceededError
|
|
│ └── ServiceUnavailableError
|
|
├── FileSystemError
|
|
│ ├── FileNotFoundError
|
|
│ ├── PermissionError
|
|
│ ├── DiskSpaceError
|
|
│ └── CorruptedFileError
|
|
├── ValidationError
|
|
│ ├── InvalidInputError
|
|
│ ├── MissingRequiredFieldError
|
|
│ └── FormatError
|
|
├── ProcessingError
|
|
│ ├── TranscriptionError
|
|
│ ├── EnhancementError
|
|
│ └── MediaProcessingError
|
|
└── ConfigurationError
|
|
├── MissingConfigError
|
|
├── InvalidConfigError
|
|
└── EnvironmentError
|
|
```
|
|
|
|
#### Error Codes:
|
|
Each error includes a standardized error code for easy identification and handling:
|
|
|
|
- `TRAX-001`: Network connection failed
|
|
- `TRAX-002`: API authentication failed
|
|
- `TRAX-003`: File not found
|
|
- `TRAX-004`: Invalid input format
|
|
- `TRAX-005`: Processing timeout
|
|
- `TRAX-006`: Configuration error
|
|
- `TRAX-007`: Recovery failed
|
|
|
|
#### Usage:
|
|
```python
|
|
from src.errors import (
|
|
NetworkError, APIError, ValidationError,
|
|
create_network_error, create_api_error
|
|
)
|
|
|
|
# Create specific errors
|
|
try:
|
|
response = await api_client.call()
|
|
except ConnectionError as e:
|
|
raise create_network_error("API connection failed", original_error=e)
|
|
|
|
# Error classification
|
|
from src.errors import classify_error, is_retryable_error
|
|
|
|
error = classify_error(exception)
|
|
if is_retryable_error(error):
|
|
# Implement retry logic
|
|
pass
|
|
```
|
|
|
|
### 3. Retry System
|
|
|
|
A robust retry system with exponential backoff, jitter, and circuit breaker patterns.
|
|
|
|
#### Features:
|
|
- **Multiple Strategies**: Exponential, linear, constant, and Fibonacci backoff
|
|
- **Jitter Support**: Prevents thundering herd problems
|
|
- **Circuit Breaker**: Prevents repeated calls to failing services
|
|
- **Async Support**: Full async/await compatibility
|
|
- **Error Classification**: Automatic retry based on error type
|
|
|
|
#### Usage:
|
|
```python
|
|
from src.retry import retry, async_retry, RetryConfig
|
|
|
|
# Basic retry decorator
|
|
@retry(max_retries=3, initial_delay=1.0)
|
|
def api_call():
|
|
return external_api.request()
|
|
|
|
# Async retry with custom config
|
|
@async_retry(RetryConfig(
|
|
max_retries=5,
|
|
initial_delay=0.5,
|
|
max_delay=30.0,
|
|
jitter=0.1
|
|
))
|
|
async def async_api_call():
|
|
return await external_api.async_request()
|
|
|
|
# Context manager
|
|
from src.retry import RetryContext
|
|
|
|
async with RetryContext(max_retries=3) as retry_ctx:
|
|
result = await retry_ctx.execute(api_function)
|
|
```
|
|
|
|
#### Circuit Breaker:
|
|
```python
|
|
from src.retry import CircuitBreaker
|
|
|
|
circuit_breaker = CircuitBreaker(
|
|
failure_threshold=5,
|
|
timeout=60.0,
|
|
expected_exception=NetworkError
|
|
)
|
|
|
|
# Circuit breaker will open after 5 failures
|
|
# and allow one test request after 60 seconds
|
|
```
|
|
|
|
### 4. Recovery Strategies
|
|
|
|
A comprehensive recovery system that provides fallback mechanisms and state recovery for different error scenarios.
|
|
|
|
#### Recovery Strategies:
|
|
- **Fallback Mechanisms**: Alternative service providers, cached responses
|
|
- **Graceful Degradation**: Reduce functionality when services are unavailable
|
|
- **State Recovery**: Resume interrupted operations from saved state
|
|
- **Transaction Rollback**: Automatic rollback of database operations
|
|
- **Resource Cleanup**: Automatic cleanup of temporary resources
|
|
- **Health Checks**: Proactive monitoring and recovery
|
|
|
|
#### Usage:
|
|
```python
|
|
from src.recovery import (
|
|
RecoveryManager, FallbackStrategy, StateRecoveryStrategy,
|
|
create_fallback_strategy
|
|
)
|
|
|
|
# Create recovery manager
|
|
recovery_manager = RecoveryManager()
|
|
|
|
# Add fallback strategy
|
|
fallback_strategy = await create_fallback_strategy(
|
|
primary_operation=whisper_transcribe,
|
|
fallback_operations=[basic_transcribe, cached_transcribe]
|
|
)
|
|
recovery_manager.add_strategy(fallback_strategy)
|
|
|
|
# Attempt recovery
|
|
result = await recovery_manager.attempt_recovery(context)
|
|
```
|
|
|
|
#### Fallback Managers:
|
|
```python
|
|
from src.recovery import TranscriptionFallbackManager
|
|
|
|
# Specialized fallback manager for transcription
|
|
transcription_fallback = TranscriptionFallbackManager()
|
|
await transcription_fallback.add_whisper_fallback(whisper_service)
|
|
await transcription_fallback.add_cached_transcription_fallback(cache_store, cache_retrieve)
|
|
|
|
# Execute with fallbacks
|
|
result = await transcription_fallback.execute_with_fallbacks(transcribe_function, audio_file)
|
|
```
|
|
|
|
#### State Recovery:
|
|
```python
|
|
from src.recovery import StateRecoveryManager, operation_state_context
|
|
|
|
# Create state recovery manager
|
|
state_manager = StateRecoveryManager(storage)
|
|
|
|
# Track operation state
|
|
async with operation_state_context(
|
|
state_manager, "transcription_123", "corr_456", "transcription"
|
|
) as state:
|
|
# Operation is automatically tracked
|
|
result = await transcribe_audio(audio_file)
|
|
# State is automatically saved on completion
|
|
|
|
# Recover interrupted operations
|
|
interrupted_ops = await state_manager.list_interrupted_operations()
|
|
for op in interrupted_ops:
|
|
recovered_state = await state_manager.recover_operation(op.operation_id)
|
|
```
|
|
|
|
### 5. Performance Metrics
|
|
|
|
A comprehensive performance monitoring system that tracks operation timing, resource usage, and system health.
|
|
|
|
#### Features:
|
|
- **Operation Timing**: Measure execution time of operations
|
|
- **Resource Monitoring**: Track memory and CPU usage
|
|
- **System Health**: Periodic monitoring of system metrics
|
|
- **Threshold Alerts**: Configurable alerts for performance issues
|
|
- **Metrics Export**: JSON export for monitoring systems
|
|
|
|
#### Usage:
|
|
```python
|
|
from src.logging import (
|
|
timing_context, async_timing_context,
|
|
timing_decorator, async_timing_decorator,
|
|
start_health_monitoring, export_all_metrics
|
|
)
|
|
|
|
# Context manager for timing
|
|
with timing_context("transcription_operation") as timer:
|
|
result = transcribe_audio(audio_file)
|
|
|
|
# Async timing
|
|
async with async_timing_context("api_call"):
|
|
response = await api_client.call()
|
|
|
|
# Decorator for automatic timing
|
|
@timing_decorator("file_processing")
|
|
def process_file(file_path):
|
|
return process_large_file(file_path)
|
|
|
|
# Health monitoring
|
|
await start_health_monitoring(interval_seconds=60)
|
|
|
|
# Export metrics
|
|
metrics_json = export_all_metrics()
|
|
```
|
|
|
|
#### Performance Metrics Collected:
|
|
- Operation duration (milliseconds)
|
|
- Memory usage (MB)
|
|
- CPU usage (percentage)
|
|
- Success/failure rates
|
|
- Operation counters
|
|
- System health metrics (CPU, memory, disk usage)
|
|
|
|
## Integration Patterns
|
|
|
|
### 1. Service Layer Integration
|
|
|
|
```python
|
|
from src.logging import get_logger, timing_context
|
|
from src.errors import create_api_error, classify_error
|
|
from src.retry import async_retry
|
|
from src.recovery import fallback_context
|
|
|
|
logger = get_logger(__name__)
|
|
|
|
class TranscriptionService:
|
|
@async_retry(max_retries=3)
|
|
async def transcribe_audio(self, audio_file: str) -> str:
|
|
with timing_context("transcription_operation") as timer:
|
|
try:
|
|
async with fallback_context(self.fallback_manager):
|
|
result = await self.whisper_client.transcribe(audio_file)
|
|
logger.info("Transcription completed", extra={
|
|
"duration_ms": timer.duration_ms,
|
|
"file_size": os.path.getsize(audio_file)
|
|
})
|
|
return result
|
|
except Exception as e:
|
|
error = classify_error(e)
|
|
logger.error("Transcription failed", extra={
|
|
"error_code": error.error_code,
|
|
"error_type": type(e).__name__
|
|
})
|
|
raise create_api_error("Transcription service failed", original_error=e)
|
|
```
|
|
|
|
### 2. CLI Integration
|
|
|
|
```python
|
|
from src.logging import get_logger, initialize_logging
|
|
from src.errors import error_handler
|
|
|
|
logger = get_logger(__name__)
|
|
|
|
@error_handler
|
|
def main():
|
|
initialize_logging()
|
|
|
|
try:
|
|
# CLI logic here
|
|
logger.info("CLI operation started")
|
|
# ... processing ...
|
|
logger.info("CLI operation completed")
|
|
except Exception as e:
|
|
logger.error("CLI operation failed", exc_info=True)
|
|
raise
|
|
```
|
|
|
|
### 3. Batch Processing Integration
|
|
|
|
```python
|
|
from src.logging import get_logger, timing_context
|
|
from src.recovery import StateRecoveryManager, operation_state_context
|
|
from src.errors import create_processing_error
|
|
|
|
logger = get_logger(__name__)
|
|
|
|
class BatchProcessor:
|
|
def __init__(self):
|
|
self.state_manager = StateRecoveryManager(storage)
|
|
|
|
async def process_batch(self, files: List[str]):
|
|
for file in files:
|
|
async with operation_state_context(
|
|
self.state_manager, f"batch_{file}", "batch_corr", "batch_processing"
|
|
) as state:
|
|
try:
|
|
with timing_context("batch_file_processing"):
|
|
result = await self.process_file(file)
|
|
logger.info("File processed successfully", extra={
|
|
"file": file,
|
|
"result_size": len(result)
|
|
})
|
|
except Exception as e:
|
|
logger.error("File processing failed", extra={
|
|
"file": file,
|
|
"error": str(e)
|
|
})
|
|
raise create_processing_error(f"Failed to process {file}", original_error=e)
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
```bash
|
|
# Logging configuration
|
|
TRAX_LOG_LEVEL=INFO
|
|
TRAX_LOG_TO_CONSOLE=true
|
|
TRAX_LOG_TO_FILE=true
|
|
TRAX_LOG_DIR=logs
|
|
TRAX_MAX_LOG_SIZE=10485760 # 10MB
|
|
TRAX_LOG_BACKUP_COUNT=5
|
|
|
|
# Error handling
|
|
TRAX_MAX_RETRIES=3
|
|
TRAX_INITIAL_RETRY_DELAY=1.0
|
|
TRAX_MAX_RETRY_DELAY=30.0
|
|
TRAX_RETRY_JITTER=0.1
|
|
|
|
# Performance monitoring
|
|
TRAX_HEALTH_MONITOR_INTERVAL=60
|
|
TRAX_METRICS_EXPORT_ENABLED=true
|
|
```
|
|
|
|
### Configuration Files
|
|
|
|
```json
|
|
{
|
|
"logging": {
|
|
"level": "INFO",
|
|
"log_to_console": true,
|
|
"log_to_file": true,
|
|
"log_to_json": true,
|
|
"log_dir": "logs",
|
|
"max_file_size": 10485760,
|
|
"backup_count": 5
|
|
},
|
|
"retry": {
|
|
"max_retries": 3,
|
|
"initial_delay": 1.0,
|
|
"max_delay": 30.0,
|
|
"jitter": 0.1
|
|
},
|
|
"recovery": {
|
|
"enabled": true,
|
|
"max_fallbacks": 3,
|
|
"timeout": 30.0
|
|
},
|
|
"metrics": {
|
|
"enabled": true,
|
|
"health_monitor_interval": 60,
|
|
"export_enabled": true
|
|
}
|
|
}
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### 1. Error Handling
|
|
- Always use specific error types from the error hierarchy
|
|
- Include contextual information in error messages
|
|
- Use error codes for consistent error identification
|
|
- Implement proper error recovery strategies
|
|
|
|
### 2. Logging
|
|
- Use structured logging with contextual information
|
|
- Include correlation IDs for request tracing
|
|
- Log at appropriate levels (DEBUG, INFO, WARNING, ERROR)
|
|
- Use performance metrics for slow operations
|
|
|
|
### 3. Retry Logic
|
|
- Only retry transient errors (network, temporary API failures)
|
|
- Use exponential backoff with jitter
|
|
- Implement circuit breakers for failing services
|
|
- Set appropriate retry limits
|
|
|
|
### 4. Recovery Strategies
|
|
- Implement fallback mechanisms for critical operations
|
|
- Use graceful degradation when possible
|
|
- Save operation state for recovery
|
|
- Clean up resources on failure
|
|
|
|
### 5. Performance Monitoring
|
|
- Monitor all critical operations
|
|
- Set appropriate thresholds for alerts
|
|
- Export metrics for external monitoring systems
|
|
- Use health checks for proactive monitoring
|
|
|
|
## Testing
|
|
|
|
### Unit Tests
|
|
```python
|
|
import pytest
|
|
from src.logging import get_logger
|
|
from src.errors import NetworkError, create_network_error
|
|
from src.retry import retry
|
|
|
|
def test_error_classification():
|
|
error = create_network_error("Connection failed")
|
|
assert isinstance(error, NetworkError)
|
|
assert error.error_code == "TRAX-001"
|
|
|
|
def test_retry_logic():
|
|
@retry(max_retries=2)
|
|
def failing_function():
|
|
raise NetworkError("Test error")
|
|
|
|
with pytest.raises(NetworkError):
|
|
failing_function()
|
|
|
|
def test_logging():
|
|
logger = get_logger("test")
|
|
with timing_context("test_operation"):
|
|
# Test operation
|
|
pass
|
|
```
|
|
|
|
### Integration Tests
|
|
```python
|
|
async def test_recovery_strategies():
|
|
recovery_manager = RecoveryManager()
|
|
# Add test strategies
|
|
# Test recovery scenarios
|
|
|
|
async def test_performance_monitoring():
|
|
await start_health_monitoring(interval_seconds=1)
|
|
# Perform operations
|
|
# Verify metrics collection
|
|
await stop_health_monitoring()
|
|
```
|
|
|
|
## Monitoring and Alerting
|
|
|
|
### Metrics Dashboard
|
|
- Operation success rates
|
|
- Response times
|
|
- Error rates by type
|
|
- Resource usage (CPU, memory, disk)
|
|
- System health status
|
|
|
|
### Alerts
|
|
- High error rates (>5%)
|
|
- Slow response times (>30s)
|
|
- High resource usage (>90%)
|
|
- Service unavailability
|
|
- Circuit breaker activations
|
|
|
|
### Log Analysis
|
|
- Error pattern analysis
|
|
- Performance bottleneck identification
|
|
- Security incident detection
|
|
- Usage pattern analysis
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **High Memory Usage**
|
|
- Check for memory leaks in long-running operations
|
|
- Monitor memory usage in performance metrics
|
|
- Implement proper resource cleanup
|
|
|
|
2. **Slow Response Times**
|
|
- Use timing contexts to identify slow operations
|
|
- Check for blocking operations
|
|
- Implement caching where appropriate
|
|
|
|
3. **High Error Rates**
|
|
- Check error logs for patterns
|
|
- Verify external service availability
|
|
- Review retry and recovery configurations
|
|
|
|
4. **Log File Issues**
|
|
- Check disk space
|
|
- Verify log rotation configuration
|
|
- Review log level settings
|
|
|
|
### Debug Mode
|
|
```python
|
|
from src.logging import enable_debug
|
|
|
|
# Enable debug mode for detailed logging
|
|
enable_debug()
|
|
|
|
# Debug mode provides:
|
|
# - Detailed error stack traces
|
|
# - Performance timing for all operations
|
|
# - Verbose retry and recovery logs
|
|
# - Memory usage tracking
|
|
```
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Distributed Tracing**: Integration with OpenTelemetry for distributed request tracing
|
|
2. **Advanced Metrics**: Custom business metrics and KPIs
|
|
3. **Machine Learning**: Anomaly detection for performance issues
|
|
4. **Security Logging**: Enhanced security event logging and monitoring
|
|
5. **Compliance**: GDPR and other compliance-related logging features
|