trax/docs/batch-processing.md

374 lines
11 KiB
Markdown

# Batch Processing System
The Trax batch processing system provides high-performance parallel processing for multiple media files with comprehensive error handling, progress tracking, and resource monitoring.
## Overview
The batch processing system is designed to handle large volumes of audio/video files efficiently while providing real-time feedback and robust error recovery. It's optimized for M3 MacBook performance with configurable worker pools and intelligent resource management.
## Key Features
### Core Capabilities
- **Parallel Processing**: Configurable worker pool (default: 8 workers for M3 MacBook)
- **Priority Queue**: Task prioritization with automatic retry mechanism
- **Real-time Progress**: 5-second interval progress reporting with resource monitoring
- **Error Recovery**: Automatic retry with exponential backoff
- **Pause/Resume**: User control over processing operations
- **Resource Monitoring**: Memory and CPU usage tracking with configurable limits
- **Quality Metrics**: Comprehensive reporting with accuracy and quality warnings
### Supported Task Types
- **Transcription**: Audio/video to text using Whisper API
- **Enhancement**: AI-powered transcript improvement using DeepSeek
- **YouTube**: Metadata extraction from YouTube URLs
- **Download**: Media file downloading and preprocessing
- **Preprocessing**: Audio format conversion and optimization
## Architecture
### Components
#### BatchProcessor
The main orchestrator that manages the entire batch processing workflow.
```python
from src.services.batch_processor import create_batch_processor
# Create processor with custom settings
processor = create_batch_processor(
max_workers=8, # Number of parallel workers
queue_size=1000, # Maximum queue size
progress_interval=5.0, # Progress reporting interval
memory_limit_mb=2048, # Memory limit in MB
cpu_limit_percent=90 # CPU usage limit
)
```
#### BatchTask
Represents individual tasks in the processing queue.
```python
from src.services.batch_processor import BatchTask, TaskType
task = BatchTask(
id="task_1_transcribe",
task_type=TaskType.TRANSCRIBE,
data={"file_path": "/path/to/audio.mp3"},
priority=0, # Lower = higher priority
max_retries=3 # Maximum retry attempts
)
```
#### BatchProgress
Tracks real-time processing progress and resource usage.
```python
from src.services.batch_processor import BatchProgress
progress = BatchProgress(total_tasks=100)
print(f"Success Rate: {progress.success_rate:.1f}%")
print(f"Memory Usage: {progress.memory_usage_mb:.1f}MB")
print(f"CPU Usage: {progress.cpu_usage_percent:.1f}%")
```
#### BatchResult
Comprehensive results summary with quality metrics.
```python
from src.services.batch_processor import BatchResult
result = BatchResult(
success_count=95,
failure_count=5,
total_count=100,
processing_time=120.5,
memory_peak_mb=512.0,
cpu_peak_percent=75.0,
quality_metrics={"avg_accuracy": 95.5}
)
```
## Usage
### Basic Batch Processing
```bash
# Process a folder of audio files
trax batch /path/to/audio/files
# Process with custom settings
trax batch /path/to/files --workers 4 --memory-limit 1024
# Process with enhancement
trax batch /path/to/files --enhance --progress-interval 2
```
### Programmatic Usage
```python
import asyncio
from src.services.batch_processor import create_batch_processor, TaskType
async def process_files():
# Create batch processor
processor = create_batch_processor(max_workers=4)
# Add transcription tasks
for file_path in audio_files:
await processor.add_task(
TaskType.TRANSCRIBE,
{"file_path": str(file_path)},
priority=0
)
# Progress callback
def progress_callback(progress):
print(f"Progress: {progress.completed_tasks}/{progress.total_tasks}")
print(f"Memory: {progress.memory_usage_mb:.1f}MB")
# Start processing
result = await processor.start(progress_callback=progress_callback)
# Display results
print(f"Success: {result.success_count}/{result.total_count}")
print(f"Processing time: {result.processing_time:.1f}s")
# Run the batch processing
asyncio.run(process_files())
```
### CLI Options
```bash
trax batch <folder> [OPTIONS]
Options:
--workers INTEGER Number of worker processes (default: 8)
--progress-interval FLOAT Progress update interval in seconds (default: 5.0)
--memory-limit INTEGER Memory limit in MB (default: 2048)
--cpu-limit INTEGER CPU usage limit percentage (default: 90)
--model TEXT Whisper model to use (default: whisper-1)
--language TEXT Language code (auto-detect if not specified)
--chunk-size INTEGER Chunk size in seconds for long files (default: 600)
--enhance Also enhance transcripts after transcription
```
## Error Handling
### Automatic Retry
- Failed tasks are automatically retried up to 3 times by default
- Retry attempts use exponential backoff with priority degradation
- Permanent failures are tracked separately for reporting
### Error Recovery Process
1. Task fails during processing
2. Error is captured and logged
3. Retry count is incremented
4. If retries remaining, task is re-queued with lower priority
5. If max retries exceeded, task is marked as permanently failed
6. Failed tasks are tracked separately for reporting
### Error Types Handled
- **Network Errors**: API timeouts, connection failures
- **File Errors**: Missing files, permission issues
- **Processing Errors**: Audio format issues, API rate limits
- **Resource Errors**: Memory exhaustion, CPU overload
## Performance Optimization
### M3 MacBook Optimization
- Default 8 workers optimized for M3 architecture
- Memory and CPU monitoring with configurable limits
- Async processing throughout for non-blocking operations
- Intelligent caching for expensive operations
### Resource Management
- Real-time memory and CPU usage monitoring
- Configurable resource limits to prevent system overload
- Automatic worker scaling based on resource availability
- Graceful degradation under high load
### Performance Benchmarks
- **Transcription**: 95%+ accuracy, <30s for 5-minute audio
- **Enhancement**: 99%+ accuracy, <35s processing time
- **Batch Processing**: Parallel processing with configurable workers
- **Resource Usage**: <2GB memory, optimized for M3 architecture
## Progress Tracking
### Real-time Monitoring
- Progress updates every 5 seconds (configurable)
- Resource usage tracking (memory, CPU)
- Active worker count monitoring
- Estimated completion time calculation
### Progress Metrics
- Total tasks, completed tasks, failed tasks
- Success rate and failure rate percentages
- Processing time and resource usage peaks
- Quality metrics by task type
### CLI Progress Display
```
Progress: 45/100 (45.0% success) | Active: 8 | Failed: 2 | Memory: 512.3MB | CPU: 75.2%
```
## Quality Metrics
### Transcription Quality
- Average accuracy across all transcription tasks
- Quality warnings for low-confidence segments
- Processing time and efficiency metrics
- Error rate and recovery statistics
### Enhancement Quality
- Average accuracy improvement across enhancement tasks
- Content preservation validation
- Processing time and efficiency metrics
- Quality validation results
### Overall Metrics
- Success rate and failure rate percentages
- Processing time and resource usage peaks
- Quality warnings aggregation and deduplication
- Detailed failure information
## Configuration
### Worker Pool Settings
```python
# Default settings for M3 MacBook
DEFAULT_WORKERS = 8
DEFAULT_QUEUE_SIZE = 1000
DEFAULT_PROGRESS_INTERVAL = 5.0
DEFAULT_MEMORY_LIMIT_MB = 2048
DEFAULT_CPU_LIMIT_PERCENT = 90
```
### Task Priority Levels
- **0**: Highest priority (immediate processing)
- **1-5**: High priority (processed quickly)
- **6-10**: Normal priority (standard processing)
- **11+**: Low priority (processed when resources available)
### Retry Configuration
- **Default Max Retries**: 3
- **Retry Backoff**: Exponential with priority degradation
- **Retry Conditions**: Network errors, temporary failures
- **Permanent Failures**: File not found, permission denied
## Testing
### Unit Tests
Comprehensive test coverage for all batch processing components:
```bash
# Run batch processor tests
uv run pytest tests/test_batch_processor.py
# Run with coverage
uv run pytest tests/test_batch_processor.py --cov=src.services.batch_processor
```
### Test Coverage
- Worker pool initialization and configuration
- Task processing and error handling
- Progress tracking and resource monitoring
- Pause/resume functionality
- Quality metrics calculation
- Integration tests for multiple task types
## Troubleshooting
### Common Issues
#### High Memory Usage
```bash
# Reduce worker count and memory limit
trax batch /path/to/files --workers 4 --memory-limit 1024
```
#### Slow Processing
```bash
# Increase worker count (if resources available)
trax batch /path/to/files --workers 12 --cpu-limit 95
```
#### Frequent Failures
- Check file permissions and accessibility
- Verify API keys and rate limits
- Monitor network connectivity
- Review error logs for specific issues
### Debug Mode
```python
import logging
# Enable debug logging
logging.basicConfig(level=logging.DEBUG)
# Process with detailed logging
processor = create_batch_processor()
# ... processing code
```
## Future Enhancements
### Planned Features
- **Distributed Processing**: Multi-machine batch processing
- **Advanced Scheduling**: Time-based and dependency-based scheduling
- **Resource Prediction**: ML-based resource usage prediction
- **Dynamic Scaling**: Automatic worker scaling based on load
- **Advanced Analytics**: Detailed performance analytics and reporting
### Performance Improvements
- **GPU Acceleration**: GPU-accelerated processing for supported tasks
- **Streaming Processing**: Real-time streaming for live content
- **Advanced Caching**: Intelligent caching with predictive loading
- **Load Balancing**: Advanced load balancing across workers
## API Reference
### BatchProcessor Methods
#### `add_task(task_type, data, priority=0)`
Add a task to the processing queue.
#### `start(progress_callback=None)`
Start batch processing with optional progress callback.
#### `pause()`
Pause batch processing (workers will complete current tasks).
#### `resume()`
Resume batch processing.
#### `stop()`
Stop batch processing immediately.
#### `get_progress()`
Get current progress information.
### Task Types
#### `TaskType.TRANSCRIBE`
Transcribe audio/video files using Whisper API.
#### `TaskType.ENHANCE`
Enhance transcripts using DeepSeek API.
#### `TaskType.YOUTUBE`
Extract metadata from YouTube URLs.
#### `TaskType.DOWNLOAD`
Download media files from URLs.
#### `TaskType.PREPROCESS`
Preprocess audio files for transcription.
---
*Last Updated: 2024-12-30*
*Version: 0.2.0*
*Status: Production Ready*