youtube-summarizer/docs/architecture.md

# YouTube Summarizer - Technical Architecture

## Architecture Overview

This document defines the comprehensive technical architecture for the YouTube Summarizer application, designed as a self-hosted, hobby-scale system with professional code quality.

### Design Principles

1. **Self-Hosted Priority**: All components run locally without external cloud dependencies (except AI API calls)
2. **Hobby Scale Optimization**: Simple deployment with Docker Compose, cost-effective (~$0.10/month)
3. **Professional Code Quality**: Modern technologies, type safety, comprehensive testing
4. **Background Processing**: User-requested priority for reliable video processing
5. **Learning-Friendly**: Technologies that provide quick feedback loops and satisfying development experience

## Technology Stack

### Backend Stack

| Component | Technology | Version | Purpose |
|-----------|------------|---------|---------|
| **Runtime** | Python | 3.11+ | AI library compatibility |
| **Framework** | FastAPI | Latest | High-performance async API |
| **Database** | SQLite → PostgreSQL | Latest | Development → Production |
| **ORM** | SQLAlchemy | 2.0+ | Async database operations |
| **Validation** | Pydantic | V2 | Request/response validation |
| **ASGI Server** | Uvicorn | Latest | Production ASGI server |
| **Testing** | pytest | Latest | Unit and integration testing |

### Frontend Stack

| Component | Technology | Version | Purpose |
|-----------|------------|---------|---------|
| **Framework** | React | 18+ | Modern UI framework |
| **Language** | TypeScript | Latest | Type-safe development |
| **Build Tool** | Vite | Latest | Fast development and building |
| **UI Library** | shadcn/ui | Latest | Component design system |
| **Styling** | Tailwind CSS | Latest | Utility-first CSS |
| **State Management** | Zustand | Latest | Global state management |
| **Server State** | React Query | Latest | API calls and caching |
| **Testing** | Vitest + RTL | Latest | Component and unit testing |

### AI & External Services

| Service | Provider | Model | Purpose |
|---------|----------|-------|---------|
| **Primary AI** | OpenAI | GPT-4o-mini | Cost-effective summarization |
| **Fallback AI** | Anthropic | Claude 3 Haiku | Backup model |
| **Alternative** | DeepSeek | DeepSeek Chat | Budget option |
| **Video APIs** | YouTube | youtube-transcript-api | Transcript extraction |
| **Metadata** | YouTube | yt-dlp | Video metadata |

### Development & Deployment

| Component | Technology | Purpose |
|-----------|------------|---------|
| **Containerization** | Docker + Docker Compose | Self-hosted deployment |
| **Code Quality** | Black + Ruff + mypy | Python formatting and linting |
| **Frontend Quality** | ESLint + Prettier | TypeScript/React standards |
| **Pre-commit** | pre-commit hooks | Automated quality checks |
| **Documentation** | FastAPI Auto Docs | API documentation |

## System Architecture

### High-Level Architecture

```
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   React Frontend │    │  FastAPI Backend │    │   AI Services   │
│                 │    │                 │    │                 │
│  • shadcn/ui    │◄──►│  • REST API     │◄──►│  • OpenAI       │
│  • TypeScript   │    │  • Background   │    │  • Anthropic    │
│  • Zustand      │    │    Tasks        │    │  • DeepSeek     │
│  • React Query  │    │  • SQLAlchemy   │    │                 │
└─────────────────┘    └─────────────────┘    └─────────────────┘
        │                       │
        │                       ▼
        │               ┌─────────────────┐
        │               │   SQLite DB     │
        └──────────────►│                 │
                        │  • Summaries    │
                        │  • Jobs         │
                        │  • Cache        │
                        └─────────────────┘
```

### Project Structure

```
youtube-summarizer/
├── frontend/                 # React TypeScript frontend
│   ├── src/
│   │   ├── components/      # UI components
│   │   │   ├── ui/         # shadcn/ui base components
│   │   │   ├── forms/      # Form components
│   │   │   ├── summary/    # Summary display components
│   │   │   ├── history/    # History management
│   │   │   ├── processing/ # Status and progress
│   │   │   ├── layout/     # Layout components
│   │   │   └── error/      # Error handling components
│   │   ├── hooks/          # Custom React hooks
│   │   │   ├── api/        # API-specific hooks
│   │   │   └── ui/         # UI utility hooks
│   │   ├── api/            # API client layer
│   │   ├── stores/         # Zustand stores
│   │   ├── types/          # TypeScript definitions
│   │   └── test/           # Test utilities
│   ├── public/             # Static assets
│   ├── package.json        # Dependencies and scripts
│   ├── vite.config.ts      # Build configuration
│   ├── vitest.config.ts    # Test configuration
│   └── tailwind.config.js  # Styling configuration
├── backend/                  # FastAPI Python backend
│   ├── api/                # API endpoints
│   │   ├── __init__.py
│   │   ├── summarize.py    # Main summarization endpoints
│   │   ├── summaries.py    # Summary retrieval endpoints
│   │   └── health.py       # Health check endpoints
│   ├── services/           # Business logic
│   │   ├── __init__.py
│   │   ├── video_service.py # YouTube integration
│   │   ├── ai_service.py   # AI model integration
│   │   └── cache_service.py # Caching logic
│   ├── models/             # Database models
│   │   ├── __init__.py
│   │   ├── summary.py      # Summary data model
│   │   └── job.py          # Processing job model
│   ├── repositories/       # Data access layer
│   │   ├── __init__.py
│   │   ├── summary_repository.py
│   │   └── job_repository.py
│   ├── core/               # Core utilities
│   │   ├── __init__.py
│   │   ├── config.py       # Configuration management
│   │   ├── database.py     # Database connection
│   │   ├── exceptions.py   # Custom exception classes
│   │   ├── security.py     # Rate limiting and validation
│   │   └── cache.py        # Caching implementation
│   ├── tests/              # Test suite
│   │   ├── unit/          # Unit tests
│   │   ├── integration/   # Integration tests
│   │   └── conftest.py    # Test configuration
│   ├── main.py             # FastAPI application entry
│   ├── requirements.txt    # Python dependencies
│   └── Dockerfile          # Container configuration
├── docker-compose.yml        # Self-hosted deployment
├── .env.example             # Environment template
├── .pre-commit-config.yaml  # Code quality hooks
├── .gitignore              # Git ignore patterns
└── README.md               # Setup and usage guide
```

## Data Models

### Summary Model

```python
class Summary(Base):
    __tablename__ = "summaries"

    # Primary key
    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)

    # Video information
    video_id = Column(String(20), nullable=False, index=True)
    video_title = Column(Text)
    video_url = Column(Text, nullable=False)
    video_duration = Column(Integer)  # Duration in seconds
    video_channel = Column(String(255))
    video_upload_date = Column(String(20))  # YYYY-MM-DD format
    video_thumbnail_url = Column(Text)
    video_view_count = Column(Integer)

    # Transcript data
    transcript_text = Column(Text)
    transcript_language = Column(String(10), default='en')
    transcript_type = Column(String(20))  # 'manual' or 'auto-generated'

    # Summary data
    summary_text = Column(Text)
    key_points = Column(JSON)  # Array of strings
    chapters = Column(JSON)    # Array of chapter objects

    # Processing metadata
    model_used = Column(String(50), nullable=False)
    processing_time = Column(Float)  # Processing time in seconds
    token_count = Column(Integer)    # Total tokens used
    cost_estimate = Column(Float)    # Estimated cost in USD

    # Timestamps
    created_at = Column(DateTime, default=datetime.utcnow, nullable=False)
    updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)

    # Cache keys for invalidation
    cache_key = Column(String(255), index=True)  # Hash of video_id + model + options
```

### Processing Job Model

```python
class ProcessingJob(Base):
    __tablename__ = "processing_jobs"

    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
    video_url = Column(Text, nullable=False)
    video_id = Column(String(20), nullable=False)

    # Job configuration
    model_name = Column(String(50), nullable=False)
    options = Column(JSON)  # Summary options (length, focus, etc.)

    # Job status
    status = Column(Enum(JobStatus), default=JobStatus.PENDING, nullable=False)
    progress_percentage = Column(Integer, default=0)
    current_step = Column(String(50))  # "validating", "extracting", "summarizing"

    # Results
    summary_id = Column(UUID(as_uuid=True))  # Foreign key to Summary
    error_message = Column(Text)
    error_code = Column(String(50))

    # Timing
    created_at = Column(DateTime, default=datetime.utcnow)
    started_at = Column(DateTime)
    completed_at = Column(DateTime)
```

## API Specification

### Core Endpoints

#### POST /api/summarize
**Purpose**: Submit a YouTube URL for summarization

**Request**:
```typescript
interface SummarizeRequest {
  url: string;           // YouTube URL
  model?: string;        // AI model selection (default: "openai")
  options?: {
    length?: "brief" | "standard" | "detailed";
    focus?: string;
  };
}
```

**Response**:
```typescript
interface SummarizeResponse {
  id: string;            // Summary ID
  video: VideoMetadata;  // Video information
  summary: SummaryData;  // Generated summary
  status: "completed" | "processing";
  processing_time: number;
}
```

#### GET /api/summary/{id}
**Purpose**: Retrieve a specific summary

**Response**:
```typescript
interface SummaryResponse {
  id: string;
  video: VideoMetadata;
  summary: SummaryData;
  created_at: string;
  metadata: ProcessingMetadata;
}
```

#### GET /api/summaries
**Purpose**: List recent summaries with optional filtering

**Query Parameters**:
- `limit`: Number of results (default: 20)
- `search`: Search term for title/content
- `model`: Filter by AI model used

### Error Handling

#### Error Response Format
```typescript
interface APIErrorResponse {
  error: {
    code: string;        // Error code (e.g., "INVALID_URL")
    message: string;     // Human-readable message
    details: object;     // Additional error context
    recoverable: boolean; // Whether retry might succeed
    timestamp: string;   // ISO timestamp
    path: string;        // Request path
  }
}
```

#### Error Codes
- `INVALID_URL`: Invalid YouTube URL format
- `VIDEO_NOT_FOUND`: Video is unavailable or private
- `TRANSCRIPT_UNAVAILABLE`: No transcript available for video
- `AI_SERVICE_ERROR`: AI service temporarily unavailable
- `RATE_LIMITED`: Too many requests from this IP
- `TOKEN_LIMIT_EXCEEDED`: Video transcript too long for model
- `UNKNOWN_ERROR`: Unexpected server error

## Frontend Architecture

### Component Architecture

#### Core Components
- **SummarizeForm**: Main URL input form with validation
- **SummaryDisplay**: Comprehensive summary viewer with export options
- **ProcessingStatus**: Real-time progress updates
- **SummaryHistory**: Searchable list of previous summaries
- **ErrorBoundary**: React error boundaries with recovery options

#### State Management

**Zustand Stores**:
```typescript
interface AppStore {
  // UI state
  theme: 'light' | 'dark';
  sidebarOpen: boolean;

  // Processing state
  currentJob: ProcessingJob | null;
  processingHistory: ProcessingJob[];

  // Settings
  defaultModel: string;
  summaryLength: string;
}

interface SummaryStore {
  summaries: Summary[];
  currentSummary: Summary | null;
  searchResults: Summary[];

  // Actions
  addSummary: (summary: Summary) => void;
  updateSummary: (id: string, updates: Partial<Summary>) => void;
  searchSummaries: (query: string) => void;
}
```

#### API Client Architecture

**TypeScript API Client**:
```typescript
class APIClient {
  private baseURL: string;
  private httpClient: AxiosInstance;

  // Configure automatic retries and error handling
  constructor(baseURL: string) {
    this.httpClient = axios.create({
      baseURL,
      timeout: 30000,
    });
    this.setupInterceptors();
  }

  // Type-safe API methods
  async summarizeVideo(request: SummarizeRequest): Promise<SummarizeResponse>;
  async getSummary(id: string): Promise<SummaryResponse>;
  async getSummaries(params?: SummaryListParams): Promise<SummaryListResponse>;
  async exportSummary(id: string, format: ExportFormat): Promise<Blob>;
}
```

## Backend Services

### Video Service
**Purpose**: Handle YouTube URL processing and transcript extraction

**Key Methods**:
```python
class VideoService:
    async def extract_video_id(self, url: str) -> str:
        """Extract video ID with comprehensive URL format support"""

    async def get_transcript(self, video_id: str) -> Dict[str, Any]:
        """Get transcript with fallback chain:
        1. Manual captions (preferred)
        2. Auto-generated captions
        3. Error with helpful message
        """

    async def get_video_metadata(self, video_id: str) -> Dict[str, Any]:
        """Extract metadata using yt-dlp for rich video information"""
```

### AI Service
**Purpose**: Manage AI model integration with provider abstraction

**Key Methods**:
```python
class AIService:
    def __init__(self, provider: str, api_key: str):
        self.provider = provider
        self.client = self._get_client(provider, api_key)

    async def generate_summary(
        self,
        transcript: str,
        video_metadata: Dict[str, Any],
        options: Dict[str, Any] = None
    ) -> Dict[str, Any]:
        """Generate structured summary with:
        - Overview paragraph
        - Key points list
        - Chapter breakdown (if applicable)
        - Cost tracking
        """
```

### Cache Service
**Purpose**: Intelligent caching to minimize API costs

**Caching Strategy**:
```python
class CacheService:
    def get_cache_key(self, video_id: str, model: str, options: Dict) -> str:
        """Generate cache key from video_id + model + options hash"""

    async def get_cached_summary(self, cache_key: str) -> Optional[Summary]:
        """Retrieve cached summary if within TTL"""

    async def cache_summary(self, cache_key: str, summary: Summary, ttl: int = 86400):
        """Store summary with 24-hour default TTL"""
```

## Testing Strategy

### Backend Testing

**Test Structure**:
```
backend/tests/
├── unit/
│   ├── test_video_service.py      # URL parsing, transcript extraction
│   ├── test_ai_service.py         # AI integration, prompt engineering
│   ├── test_cache_service.py      # Cache logic, key generation
│   └── test_repositories.py      # Database operations
├── integration/
│   ├── test_api.py               # End-to-end API testing
│   ├── test_background_jobs.py   # Background processing
│   └── test_error_handling.py    # Error scenarios
└── conftest.py                   # Test configuration and fixtures
```

**Testing Patterns**:
- **Repository Pattern Testing**: Mock database, test data operations
- **Service Layer Testing**: Mock external APIs, test business logic
- **API Endpoint Testing**: FastAPI TestClient for request/response testing
- **Error Scenario Testing**: Comprehensive error condition coverage

### Frontend Testing

**Test Structure**:
```
frontend/src/
├── components/
│   ├── SummarizeForm.test.tsx    # Form validation, submission
│   ├── SummaryDisplay.test.tsx   # Summary rendering, export
│   └── ErrorBoundary.test.tsx    # Error handling components
├── hooks/
│   ├── api/
│   │   └── useSummarization.test.ts # API hook testing
│   └── ui/
├── test/
│   ├── setup.ts                  # Global test configuration
│   ├── mocks/                    # API and component mocks
│   └── utils.tsx                 # Test utilities and wrappers
└── api/
    └── client.test.ts            # API client testing
```

**Testing Patterns**:
- **Component Testing**: Render, interaction, and state testing
- **Custom Hook Testing**: Logic testing with renderHook
- **API Client Testing**: Mock HTTP responses, error handling
- **Integration Testing**: Full user flow testing

### Test Configuration

**pytest Configuration** (`backend/pytest.ini`):
```ini
[tool:pytest]
testpaths = tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*
addopts =
    --verbose
    --cov=.
    --cov-report=html
    --cov-report=term-missing
    --asyncio-mode=auto
```

**Vitest Configuration** (`frontend/vitest.config.ts`):
```typescript
export default defineConfig({
  plugins: [react()],
  test: {
    environment: 'jsdom',
    setupFiles: ['./src/test/setup.ts'],
    globals: true,
    css: true,
    coverage: {
      reporter: ['text', 'html', 'json'],
      exclude: ['node_modules/', 'src/test/']
    }
  }
});
```

## Deployment Architecture

### Self-Hosted Docker Deployment

**Docker Compose Configuration**:
```yaml
version: '3.8'

services:
  backend:
    build: ./backend
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=sqlite:///./data/youtube_summarizer.db
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    volumes:
      - ./data:/app/data
      - ./logs:/app/logs
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    restart: unless-stopped

  frontend:
    build: ./frontend
    ports:
      - "3000:3000"
    environment:
      - REACT_APP_API_URL=http://localhost:8000
    depends_on:
      - backend
    restart: unless-stopped
```

### Environment Configuration

**Required Environment Variables**:
```bash
# API Keys (at least one required)
OPENAI_API_KEY=sk-your-openai-key
ANTHROPIC_API_KEY=sk-ant-your-anthropic-key
DEEPSEEK_API_KEY=sk-your-deepseek-key

# Database
DATABASE_URL=sqlite:///./data/youtube_summarizer.db

# Security
SECRET_KEY=your-secret-key-here
CORS_ORIGINS=http://localhost:3000,http://localhost:5173

# Optional: YouTube API for metadata
YOUTUBE_API_KEY=your-youtube-api-key

# Application Settings
MAX_VIDEO_LENGTH_MINUTES=180
RATE_LIMIT_PER_MINUTE=30
CACHE_TTL_HOURS=24

# Frontend Environment Variables
REACT_APP_API_URL=http://localhost:8000
REACT_APP_ENVIRONMENT=development
```

## Security Considerations

### Input Validation
- **URL Validation**: Comprehensive YouTube URL format checking
- **Input Sanitization**: HTML escaping and XSS prevention
- **Request Size Limits**: Prevent oversized requests

### Rate Limiting
```python
class RateLimiter:
    def __init__(self, max_requests: int = 30, window_seconds: int = 60):
        self.max_requests = max_requests
        self.window_seconds = window_seconds

    def is_allowed(self, client_ip: str) -> bool:
        """Check if request is allowed for this IP"""
```

### API Key Management
- Environment variable storage (never commit to repository)
- Rotation capability for production deployments
- Separate keys for different environments

### CORS Configuration
```python
app.add_middleware(
    CORSMiddleware,
    allow_origins=["http://localhost:3000", "http://localhost:5173"],
    allow_credentials=True,
    allow_methods=["GET", "POST", "PUT", "DELETE"],
    allow_headers=["*"],
)
```

## Performance Optimization

### Backend Optimization
- **Async Everything**: All I/O operations use async/await
- **Background Processing**: Long-running tasks don't block requests
- **Intelligent Caching**: Memory and database caching layers
- **Connection Pooling**: Database connection reuse

### Frontend Optimization
- **Virtual Scrolling**: Handle large summary lists efficiently
- **Debounced Search**: Reduce API calls during user input
- **Code Splitting**: Load components only when needed
- **React Query Caching**: Automatic request deduplication and caching

### Caching Strategy
```python
# Multi-layer caching approach
# 1. Memory cache for hot data (current session)
# 2. Database cache for persistence (24-hour TTL)
# 3. Smart cache keys: hash(video_id + model + options)

def get_cache_key(video_id: str, model: str, options: dict) -> str:
    key_data = f"{video_id}:{model}:{json.dumps(options, sort_keys=True)}"
    return hashlib.sha256(key_data.encode()).hexdigest()
```

## Cost Optimization

### AI API Cost Management
- **Model Selection**: Default to GPT-4o-mini (~$0.01/1K tokens)
- **Token Optimization**: Efficient prompts and transcript chunking
- **Caching Strategy**: 24-hour cache reduces repeat API calls
- **Usage Tracking**: Monitor and alert on cost thresholds

### Target Cost Structure (Hobby Scale)
- **Base Cost**: ~$0.10/month for typical usage
- **Video Processing**: ~$0.001-0.005 per 30-minute video
- **Caching Benefit**: ~80% reduction in repeat processing costs

## Development Workflow

### Quick Start Commands
```bash
# Development setup
git clone <repository>
cd youtube-summarizer
cp .env.example .env
# Edit .env with your API keys

# Single command startup
docker-compose up

# Access points
# Frontend: http://localhost:3000
# Backend API: http://localhost:8000
# API Docs: http://localhost:8000/docs
```

### Development Scripts
```json
{
  "scripts": {
    "dev": "docker-compose up",
    "dev:backend": "cd backend && uvicorn main:app --reload",
    "dev:frontend": "cd frontend && npm run dev",
    "test": "npm run test:backend && npm run test:frontend",
    "test:backend": "cd backend && pytest",
    "test:frontend": "cd frontend && npm test",
    "build": "docker-compose build",
    "lint": "npm run lint:backend && npm run lint:frontend",
    "lint:backend": "cd backend && ruff . && black . && mypy .",
    "lint:frontend": "cd frontend && eslint src && prettier --check src"
  }
}
```

### Git Hooks
```yaml
# .pre-commit-config.yaml
repos:
  - repo: https://github.com/psf/black
    rev: 23.3.0
    hooks:
      - id: black
        files: ^backend/

  - repo: https://github.com/charliermarsh/ruff-pre-commit
    rev: v0.0.270
    hooks:
      - id: ruff
        files: ^backend/

  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.3.0
    hooks:
      - id: mypy
        files: ^backend/
        additional_dependencies: [types-all]

  - repo: https://github.com/pre-commit/mirrors-eslint
    rev: v8.42.0
    hooks:
      - id: eslint
        files: ^frontend/src/
        types: [file]
        types_or: [typescript, tsx]
```

---

## Architecture Decision Records

### ADR-001: Self-Hosted Architecture Choice
**Status**: Accepted
**Context**: User explicitly requested "no imselfhosting" and hobby-scale deployment
**Decision**: Docker Compose deployment with local database storage
**Consequences**: Simplified deployment, reduced costs, requires local resource management

### ADR-002: AI Model Strategy
**Status**: Accepted
**Context**: Cost optimization for hobby use while maintaining quality
**Decision**: Primary OpenAI GPT-4o-mini, fallback to other models
**Consequences**: ~$0.10/month costs, good quality summaries, multiple provider support

### ADR-003: Database Evolution Path
**Status**: Accepted
**Context**: Start simple but allow growth to production scale
**Decision**: SQLite for development/hobby, PostgreSQL migration path for production
**Consequences**: Zero-config development start, clear upgrade path when needed

---

*This architecture document serves as the definitive technical guide for implementing the YouTube Summarizer application.*