555 lines
18 KiB
Plaintext
555 lines
18 KiB
Plaintext
---
|
|
description: Trax project structure patterns and conventions for consistent development
|
|
globs: src/**/*.py, tests/**/*.py, docs/**/*.md, scripts/**/*.sh, *.toml, *.md
|
|
alwaysApply: false
|
|
---
|
|
|
|
# Trax Project Structure Rules
|
|
|
|
**Trax** is a production-ready media transcription platform with protocol-based architecture, optimized for M3 MacBook performance using the ultra-fast `uv` package manager.
|
|
|
|
> **Note**: For project overview, quick start, and navigation, see [agents.mdc](mdc:.cursor/rules/agents.mdc). This rule focuses on technical implementation patterns and directory structure standards.
|
|
|
|
## Architecture Principles
|
|
|
|
- **Protocol-Based Design**: Use `typing.Protocol` for all service interfaces
|
|
- **Database Registry Pattern**: Prevent SQLAlchemy "multiple classes" errors
|
|
- **Download-First Architecture**: Always download media before processing
|
|
- **Real Files Testing**: Use actual audio files, never mocks
|
|
- **Backend-First Development**: Get data layer right before UI
|
|
- **Single Responsibility**: Keep files under 300 LOC (350 max if justified)
|
|
|
|
## Directory Structure Standards
|
|
|
|
### Root Directory Organization
|
|
```
|
|
trax/
|
|
├── AGENTS.md # Development rules for AI agents
|
|
├── CLAUDE.md # Project context for Claude Code
|
|
├── EXECUTIVE-SUMMARY.md # High-level project overview
|
|
├── PROJECT-DIRECTORY.md # Directory structure documentation
|
|
├── README.md # Project introduction and quick start
|
|
├── pyproject.toml # Project configuration and dependencies
|
|
├── requirements.txt # Locked dependencies (uv generated)
|
|
├── alembic.ini # Database migration configuration
|
|
└── scratchpad.md # Temporary notes and ideas
|
|
```
|
|
|
|
### Source Code Organization (`src/`)
|
|
```
|
|
src/
|
|
├── config.py # Centralized configuration system
|
|
├── cli/ # Command-line interface
|
|
│ ├── main.py # Click-based CLI implementation
|
|
│ ├── enhanced_cli.py # Enhanced CLI with progress
|
|
│ ├── research.py # Research agent CLI
|
|
│ └── commands/ # Command modules
|
|
├── services/ # Business logic services
|
|
│ ├── protocols.py # Service interfaces (REQUIRED)
|
|
│ ├── transcription_service.py
|
|
│ ├── media_service.py
|
|
│ ├── enhancement/ # AI enhancement services
|
|
│ ├── research/ # Research agent services
|
|
│ └── mocks/ # Mock implementations for testing
|
|
├── repositories/ # Data access layer
|
|
│ ├── media_repository.py
|
|
│ ├── transcription_repository.py
|
|
│ └── youtube_repository.py
|
|
├── database/ # Database layer
|
|
│ ├── __init__.py # Registry pattern implementation
|
|
│ ├── models.py # All models in single file
|
|
│ ├── connection.py # Connection management
|
|
│ └── utils.py # Database utilities
|
|
├── base/ # Base classes and shared functionality
|
|
│ ├── services.py # Base service implementations
|
|
│ ├── repositories.py # Repository base classes
|
|
│ └── processors.py # Processing base classes
|
|
├── errors/ # Error handling system
|
|
│ ├── base.py # Base error classes
|
|
│ ├── codes.py # Error code definitions
|
|
│ └── classification.py # Error classification
|
|
├── logging/ # Logging configuration
|
|
│ ├── config.py # Logging setup
|
|
│ ├── metrics.py # Performance metrics
|
|
│ └── utils.py # Logging utilities
|
|
├── security/ # Security components
|
|
│ ├── encrypted_storage.py # Secure storage
|
|
│ ├── input_sanitization.py # Input validation
|
|
│ └── user_permissions.py # Access control
|
|
└── agents/ # AI agent components
|
|
├── rules/ # Agent rule files
|
|
│ ├── TRANSCRIPTION_RULES.md
|
|
│ ├── BATCH_PROCESSING_RULES.md
|
|
│ ├── DATABASE_RULES.md
|
|
│ ├── CACHING_RULES.md
|
|
│ └── EXPORT_RULES.md
|
|
└── tools/ # Agent tools
|
|
```
|
|
|
|
### Testing Structure (`tests/`)
|
|
```
|
|
tests/
|
|
├── conftest.py # Pytest configuration and fixtures
|
|
├── fixtures/ # Test fixtures and data
|
|
│ ├── audio/ # REAL audio files (no mocks)
|
|
│ │ ├── sample_5s.wav # 5-second test file
|
|
│ │ ├── sample_30s.mp3 # 30-second test file
|
|
│ │ ├── sample_2m.mp4 # 2-minute test file
|
|
│ │ ├── sample_noisy.wav # Noisy audio test
|
|
│ │ ├── sample_multi.wav # Multi-speaker test
|
|
│ │ └── sample_tech.mp3 # Technical content test
|
|
│ └── README.md # Test fixtures documentation
|
|
├── test_*.py # Individual test modules
|
|
└── testing_suite.py # Comprehensive test suite
|
|
```
|
|
|
|
### Documentation Structure (`docs/`)
|
|
```
|
|
docs/
|
|
├── architecture/ # Architecture documentation
|
|
│ ├── development-patterns.md # Historical learnings
|
|
│ ├── audio-processing.md # Audio pipeline details
|
|
│ ├── error-handling-and-logging.md # Error system
|
|
│ └── iterative-pipeline.md # Version progression
|
|
├── reports/ # Analysis reports
|
|
│ ├── 01-repository-inventory.md
|
|
│ ├── 02-historical-context.md
|
|
│ ├── 03-architecture-design.md
|
|
│ ├── 04-team-structure.md
|
|
│ ├── 05-technical-migration.md
|
|
│ └── 06-product-vision.md
|
|
├── templates/ # Documentation templates
|
|
│ ├── ai-friendly-prd-template.md
|
|
│ ├── adaptive-prd-template.md
|
|
│ └── ecosystem-prd-template.md
|
|
├── CLI.md # Command reference
|
|
├── API.md # API documentation
|
|
├── DATABASE.md # Database schema
|
|
├── RESEARCH_AGENT.md # Research agent docs
|
|
└── TROUBLESHOOTING.md # Common issues
|
|
```
|
|
|
|
### Data Organization (`data/`)
|
|
```
|
|
data/
|
|
├── media/ # Media file storage
|
|
│ ├── downloads/ # Downloaded media files
|
|
│ └── processed/ # Processed audio files
|
|
├── exports/ # Export output files
|
|
│ ├── json/ # JSON export files
|
|
│ └── txt/ # Text export files
|
|
└── cache/ # Cache storage (if used)
|
|
```
|
|
|
|
|
|
|
|
### Scripts Directory (`scripts/`)
|
|
```
|
|
scripts/
|
|
├── setup_dev.sh # Development environment setup
|
|
├── setup_postgresql.sh # Database initialization
|
|
├── tm_master.sh # Taskmaster master interface
|
|
├── tm_status.sh # Status checking
|
|
├── tm_search.sh # Task searching
|
|
├── tm_workflow.sh # Workflow management
|
|
├── tm_analyze.sh # Analysis tools
|
|
├── tm_quick.sh # Quick operations
|
|
|
|
└── README_taskmaster_helpers.md # Helper scripts documentation
|
|
```
|
|
|
|
## Coding Patterns and Conventions
|
|
|
|
### Service Layer Patterns
|
|
|
|
### Protocol-Based Interfaces
|
|
```python
|
|
# ✅ DO: Use Protocol-Based Interfaces
|
|
# src/services/protocols.py
|
|
from typing import Protocol, runtime_checkable
|
|
from pathlib import Path
|
|
|
|
@runtime_checkable
|
|
class TranscriptionServiceProtocol(Protocol):
|
|
"""Protocol for transcription services."""
|
|
|
|
async def transcribe_file(
|
|
self,
|
|
media_file: MediaFile,
|
|
config: Optional[TranscriptionConfig] = None
|
|
) -> TranscriptionResult:
|
|
"""Transcribe a media file."""
|
|
...
|
|
|
|
# Implementation
|
|
class WhisperService:
|
|
"""Implements TranscriptionServiceProtocol."""
|
|
|
|
async def transcribe_file(self, media_file, config=None):
|
|
# Implementation here
|
|
pass
|
|
```
|
|
|
|
### Factory Functions
|
|
```python
|
|
# ✅ DO: Use Factory Functions
|
|
# src/services/factories.py
|
|
def create_transcription_service(config: Dict[str, Any]) -> TranscriptionServiceProtocol:
|
|
"""Create transcription service instance."""
|
|
service_type = config.get("type", "whisper")
|
|
|
|
if service_type == "whisper":
|
|
return WhisperService(config)
|
|
elif service_type == "mock":
|
|
return MockTranscriptionService(config)
|
|
else:
|
|
raise ValueError(f"Unknown service type: {service_type}")
|
|
```
|
|
|
|
### Database Layer Patterns
|
|
|
|
### Registry Pattern for Models
|
|
```python
|
|
# ✅ DO: Use Registry Pattern for Models
|
|
# src/database/__init__.py
|
|
from typing import Dict, Type
|
|
from sqlalchemy.ext.declarative import declarative_base
|
|
|
|
Base = declarative_base()
|
|
|
|
# Model registry to prevent SQLAlchemy conflicts
|
|
_model_registry: Dict[str, Type[Base]] = {}
|
|
|
|
def register_model(model_class: Type[Base]) -> Type[Base]:
|
|
"""Register a model in the central registry."""
|
|
name = model_class.__name__
|
|
if name in _model_registry:
|
|
return _model_registry[name] # Return existing
|
|
_model_registry[name] = model_class
|
|
return model_class
|
|
|
|
# Usage in models
|
|
@register_model
|
|
class MediaFile(Base):
|
|
__tablename__ = "media_files"
|
|
# Model definition here
|
|
```
|
|
|
|
### JSONB for Flexible Data
|
|
```python
|
|
# ✅ DO: Use JSONB for Flexible Data
|
|
# src/database/models.py
|
|
from sqlalchemy.dialects.postgresql import JSONB, UUID as PGUUID
|
|
|
|
class TranscriptionResult(Base):
|
|
__tablename__ = "transcription_results"
|
|
|
|
id = Column(PGUUID(as_uuid=True), primary_key=True, default=uuid4)
|
|
content = Column(JSONB, nullable=False) # Flexible transcript data
|
|
segments = Column(JSONB) # Timestamped segments
|
|
confidence_scores = Column(JSONB) # Confidence metrics
|
|
processing_metadata = Column(JSONB) # Additional metadata
|
|
```
|
|
|
|
### Repository Layer Patterns
|
|
|
|
### Repository Protocols
|
|
```python
|
|
# ✅ DO: Define Repository Protocols
|
|
# src/repositories/protocols.py
|
|
@runtime_checkable
|
|
class MediaRepositoryProtocol(Protocol):
|
|
"""Protocol for media file repository operations."""
|
|
|
|
async def create(self, media_data: Dict[str, Any]) -> MediaFile:
|
|
"""Create a new media file record."""
|
|
...
|
|
|
|
async def get_by_id(self, media_id: UUID) -> Optional[MediaFile]:
|
|
"""Get media file by ID."""
|
|
...
|
|
```
|
|
|
|
### Error Handling Patterns
|
|
|
|
### Hierarchical Error System
|
|
```python
|
|
# ✅ DO: Use Hierarchical Error System
|
|
# src/errors/base.py
|
|
class TraxError(Exception):
|
|
"""Base exception for all Trax platform errors."""
|
|
|
|
def __init__(
|
|
self,
|
|
message: str,
|
|
error_code: Optional[ErrorCode] = None,
|
|
context: Optional[Dict[str, Any]] = None,
|
|
original_error: Optional[Exception] = None
|
|
):
|
|
super().__init__(message)
|
|
self.message = message
|
|
self.error_code = error_code
|
|
self.context = context or {}
|
|
self.original_error = original_error
|
|
self.timestamp = datetime.now(timezone.utc)
|
|
|
|
# Specific error types
|
|
class TranscriptionError(TraxError):
|
|
"""Error raised when transcription processing fails."""
|
|
pass
|
|
|
|
class MediaProcessingError(TraxError):
|
|
"""Error raised when media processing fails."""
|
|
pass
|
|
```
|
|
|
|
### Configuration Patterns
|
|
|
|
### Centralized Configuration
|
|
```python
|
|
# ✅ DO: Use Centralized Configuration
|
|
# src/config.py
|
|
class Config:
|
|
"""Centralized configuration for the trax project."""
|
|
|
|
# Project paths
|
|
PROJECT_ROOT = Path(__file__).parent.parent
|
|
DATA_DIR = PROJECT_ROOT / "data"
|
|
|
|
# API Keys - AI Services (from root .env)
|
|
ANTHROPIC_API_KEY: Optional[str] = os.getenv("ANTHROPIC_API_KEY")
|
|
DEEPSEEK_API_KEY: Optional[str] = os.getenv("DEEPSEEK_API_KEY")
|
|
OPENAI_API_KEY: Optional[str] = os.getenv("OPENAI_API_KEY")
|
|
|
|
@classmethod
|
|
def validate_required_keys(cls, required_keys: List[str]) -> bool:
|
|
"""Validate that required API keys are present."""
|
|
missing_keys = []
|
|
for key in required_keys:
|
|
if not getattr(cls, key, None):
|
|
missing_keys.append(key)
|
|
|
|
if missing_keys:
|
|
print(f"❌ Missing required API keys: {', '.join(missing_keys)}")
|
|
return False
|
|
|
|
return True
|
|
|
|
# Create convenience instance
|
|
config = Config()
|
|
```
|
|
|
|
### CLI Patterns
|
|
|
|
### Click with Rich for User Interface
|
|
```python
|
|
# ✅ DO: Use Click with Rich for User Interface
|
|
# src/cli/main.py
|
|
import click
|
|
from rich.console import Console
|
|
from rich.progress import Progress, SpinnerColumn, TextColumn
|
|
|
|
console = Console()
|
|
|
|
@click.group()
|
|
@click.version_option(version="1.0.0")
|
|
def cli():
|
|
"""Trax: Personal Research Transcription Tool"""
|
|
pass
|
|
|
|
@cli.command()
|
|
@click.argument("input_file", type=click.Path(exists=True))
|
|
@click.option("--output", "-o", help="Output directory")
|
|
@click.option("--format", "-f", type=click.Choice(["json", "txt", "srt"]))
|
|
def transcribe(input_file, output, format):
|
|
"""Transcribe a media file."""
|
|
with Progress(
|
|
SpinnerColumn(),
|
|
TextColumn("[progress.description]{task.description}"),
|
|
console=console,
|
|
) as progress:
|
|
task = progress.add_task("Transcribing...", total=None)
|
|
# Processing logic here
|
|
progress.update(task, description="Complete!")
|
|
```
|
|
|
|
### Testing Patterns
|
|
|
|
### Real Audio Files for Testing
|
|
```python
|
|
# ✅ DO: Use Real Audio Files for Testing
|
|
# tests/conftest.py
|
|
@pytest.fixture
|
|
def sample_audio_files():
|
|
"""Provide real audio files for testing."""
|
|
return {
|
|
"short": Path("tests/fixtures/audio/sample_5s.wav"),
|
|
"medium": Path("tests/fixtures/audio/sample_30s.mp3"),
|
|
"long": Path("tests/fixtures/audio/sample_2m.mp4"),
|
|
"noisy": Path("tests/fixtures/audio/sample_noisy.wav"),
|
|
"multi_speaker": Path("tests/fixtures/audio/sample_multi.wav"),
|
|
"technical": Path("tests/fixtures/audio/sample_tech.mp3"),
|
|
}
|
|
|
|
# Test implementation
|
|
async def test_transcription_accuracy(sample_audio_files, transcription_service):
|
|
"""Test transcription with real audio files."""
|
|
result = await transcription_service.transcribe_file(
|
|
sample_audio_files["short"]
|
|
)
|
|
|
|
assert result.accuracy >= 0.95 # 95% accuracy requirement
|
|
assert len(result.segments) > 0
|
|
assert result.processing_time < 30.0 # Performance requirement
|
|
```
|
|
|
|
### Anti-Patterns
|
|
```python
|
|
# ❌ DON'T: Use Mocks for Core Functionality
|
|
# BAD: Mocking audio processing
|
|
@patch("whisper.load_model")
|
|
def test_transcription_mock(mock_whisper):
|
|
# This won't catch real audio processing issues
|
|
pass
|
|
|
|
# GOOD: Use real files with small samples
|
|
def test_transcription_real(sample_audio_files):
|
|
# Tests actual audio processing pipeline
|
|
pass
|
|
```
|
|
|
|
## File Size and Organization Guidelines
|
|
|
|
### Keep Files Focused and Manageable
|
|
- **Maximum 300 LOC** per file (350 if well-justified)
|
|
- **Single responsibility** per module
|
|
- **Clear naming** that describes purpose
|
|
- **Logical grouping** of related functionality
|
|
|
|
### Protocol and Implementation Separation
|
|
```python
|
|
# protocols.py - Interface definitions only
|
|
@runtime_checkable
|
|
class ServiceProtocol(Protocol):
|
|
def method(self) -> Result: ...
|
|
|
|
# service_impl.py - Implementation
|
|
class ConcreteService:
|
|
def method(self) -> Result:
|
|
# Implementation here
|
|
pass
|
|
|
|
# __init__.py - Public API
|
|
from .protocols import ServiceProtocol
|
|
from .service_impl import ConcreteService
|
|
|
|
__all__ = ["ServiceProtocol", "ConcreteService"]
|
|
```
|
|
|
|
## Development Workflow Patterns
|
|
|
|
### Adding New Services
|
|
1. **Define Protocol** in `src/services/protocols.py`
|
|
2. **Create Implementation** in `src/services/service_name.py`
|
|
3. **Add Factory Function** in `src/services/factories.py`
|
|
4. **Write Tests** with real data in `tests/test_service_name.py`
|
|
5. **Update Documentation** in `docs/`
|
|
|
|
### Database Changes
|
|
1. **Update Models** in `src/database/models.py`
|
|
2. **Create Migration** with `alembic revision -m "description"`
|
|
3. **Test Migration** with up/down paths
|
|
4. **Update Documentation** in `docs/DATABASE.md`
|
|
5. **Update Changelog** in `CHANGELOG.md`
|
|
|
|
### CLI Enhancements
|
|
1. **Add Command** in appropriate `src/cli/commands/` module
|
|
2. **Register Command** in `src/cli/main.py`
|
|
3. **Add Progress Reporting** with Rich
|
|
4. **Write Integration Test** in `tests/test_cli.py`
|
|
5. **Update CLI Documentation** in `docs/CLI.md`
|
|
|
|
## Performance and Resource Management
|
|
|
|
### Memory Usage Guidelines
|
|
- **Target <2GB** for v1 pipeline
|
|
- **Monitor Memory** with progress callbacks
|
|
- **Cleanup Resources** after processing
|
|
- **Use Streaming** for large files when possible
|
|
|
|
### Concurrency Patterns
|
|
```python
|
|
# ✅ DO: Use asyncio for I/O operations
|
|
async def process_batch(files: List[Path]) -> List[Result]:
|
|
"""Process files concurrently."""
|
|
semaphore = asyncio.Semaphore(8) # M3 optimized
|
|
|
|
async def process_with_limit(file_path):
|
|
async with semaphore:
|
|
return await process_file(file_path)
|
|
|
|
tasks = [process_with_limit(f) for f in files]
|
|
return await asyncio.gather(*tasks)
|
|
```
|
|
|
|
## Documentation Standards
|
|
|
|
### Rule Files Structure
|
|
```markdown
|
|
# Rule Title
|
|
|
|
## Core Principles
|
|
1. **Principle 1**: Description
|
|
2. **Principle 2**: Description
|
|
|
|
## Implementation Patterns
|
|
|
|
### Pattern Name
|
|
```code
|
|
# Example implementation
|
|
```
|
|
|
|
### Anti-Patterns
|
|
```code
|
|
# What NOT to do
|
|
```
|
|
```markdown
|
|
## Performance Guidelines
|
|
- Guideline 1
|
|
- Guideline 2
|
|
```
|
|
|
|
### API Documentation
|
|
- **Use Docstrings** for all public interfaces
|
|
- **Include Examples** in documentation
|
|
- **Document Protocols** with clear contracts
|
|
- **Update README.md** for user-facing changes
|
|
- **Update agents.mdc** for project context and navigation changes
|
|
|
|
## Security and Validation
|
|
|
|
### Input Sanitization
|
|
```python
|
|
# ✅ DO: Sanitize and validate user input
|
|
# src/security/input_sanitization.py
|
|
def sanitize_file_path(path: str) -> Path:
|
|
"""Sanitize and validate file paths."""
|
|
# Remove dangerous characters
|
|
clean_path = re.sub(r'[<>:"|?*]', '', path)
|
|
|
|
# Prevent directory traversal
|
|
if '..' in clean_path:
|
|
raise ValidationError("Directory traversal not allowed")
|
|
|
|
return Path(clean_path)
|
|
```
|
|
|
|
### Environment Configuration
|
|
- **API Keys** inherited from root project `.env`
|
|
- **Local Overrides** via `.env.local`
|
|
- **Validation** of required keys at startup
|
|
- **Secure Storage** for sensitive data
|
|
|
|
---
|
|
|
|
**This rule ensures consistent project structure and development patterns across the Trax media transcription platform.** |