youtube-summarizer/FILE_STRUCTURE.md

# YouTube Summarizer - File Structure

## Project Overview

The YouTube Summarizer is a comprehensive web application for extracting, transcribing, and summarizing YouTube videos with AI. It features a 9-tier fallback chain for reliable transcript extraction and audio retention for re-transcription.

## Directory Structure

```
youtube-summarizer/
├── scripts/                      # Development and deployment tools ✅ NEW
│   ├── restart-backend.sh       # Backend server restart script
│   ├── restart-frontend.sh      # Frontend server restart script
│   └── restart-both.sh          # Full stack restart script
├── logs/                        # Server logs (auto-created by scripts)
├── backend/                     # FastAPI backend application
│   ├── api/                     # API endpoints and routers
│   │   ├── auth.py             # Authentication endpoints (register, login, logout)
│   │   ├── batch.py            # Batch processing endpoints
│   │   ├── enhanced_export.py  # Enhanced export with AI intelligence ✅ Story 4.4
│   │   ├── export.py           # Export functionality endpoints
│   │   ├── history.py          # Job history API endpoints ✅ NEW
│   │   ├── pipeline.py         # Main summarization pipeline
│   │   ├── summarization.py    # AI summarization endpoints
│   │   ├── templates.py        # Template management
│   │   └── transcripts.py      # Dual transcript extraction (YouTube/Whisper)
│   ├── config/                  # Configuration modules
│   │   ├── settings.py         # Application settings
│   │   └── video_download_config.py  # Video download & storage config
│   ├── core/                    # Core utilities and foundations
│   │   ├── database_registry.py  # SQLAlchemy singleton registry pattern
│   │   ├── exceptions.py       # Custom exception classes
│   │   └── websocket_manager.py  # WebSocket connection management
│   ├── models/                  # Database models
│   │   ├── base.py             # Base model with registry integration
│   │   ├── batch.py            # Batch processing models
│   │   ├── enhanced_export.py  # Enhanced export database models ✅ Story 4.4
│   │   ├── job_history.py      # Job history models and schemas ✅ NEW
│   │   ├── summary.py          # Summary and transcript models
│   │   ├── user.py             # User authentication models
│   │   └── video_download.py   # Video download enums and configs
│   ├── services/                # Business logic services
│   │   ├── anthropic_summarizer.py  # Claude AI integration
│   │   ├── auth_service.py     # Authentication service
│   │   ├── batch_processing_service.py  # Batch job management
│   │   ├── cache_manager.py    # Multi-level caching
│   │   ├── dual_transcript_service.py  # Orchestrates YouTube/Whisper
│   │   ├── enhanced_markdown_formatter.py  # Professional document templates ✅ Story 4.4
│   │   ├── enhanced_template_manager.py  # Domain-specific AI templates ✅ Story 4.4
│   │   ├── executive_summary_generator.py  # Business-focused AI summaries ✅ Story 4.4
│   │   ├── export_service.py   # Multi-format export
│   │   ├── intelligent_video_downloader.py  # 9-tier fallback chain
│   │   ├── job_history_service.py  # Job history management ✅ NEW
│   │   ├── notification_service.py  # Real-time notifications
│   │   ├── summary_pipeline.py # Main processing pipeline
│   │   ├── timestamp_processor.py  # Semantic section detection ✅ Story 4.4
│   │   ├── transcript_service.py  # Core transcript extraction
│   │   ├── video_service.py    # YouTube metadata extraction
│   │   ├── whisper_transcript_service.py  # Legacy OpenAI Whisper (deprecated)
│   │   └── faster_whisper_transcript_service.py  # ⚡ Faster-Whisper (20-32x speed) ✅ NEW
│   ├── tests/                   # Test suites
│   │   ├── unit/               # Unit tests (229+ tests)
│   │   └── integration/        # Integration tests
│   ├── .env                    # Environment configuration
│   ├── CLAUDE.md              # Backend-specific AI guidance
│   └── main.py                # FastAPI application entry point
│
├── frontend/                    # React TypeScript frontend
│   ├── src/
│   │   ├── api/               # API client and endpoints
│   │   │   ├── apiClient.ts   # Axios-based API client
│   │   │   └── historyAPI.ts  # Job history API client ✅ NEW
│   │   ├── components/        # Reusable React components
│   │   │   ├── auth/          # Authentication components
│   │   │   │   ├── ConditionalProtectedRoute.tsx  # Smart auth wrapper ✅ NEW
│   │   │   │   └── ProtectedRoute.tsx  # Standard auth protection
│   │   │   ├── history/       # History system components ✅ NEW
│   │   │   │   └── JobDetailModal.tsx  # Enhanced history detail modal
│   │   │   ├── Batch/         # Batch processing UI
│   │   │   ├── Export/        # Export dialog components
│   │   │   ├── ProcessingProgress.tsx  # Real-time progress
│   │   │   ├── SummarizeForm.tsx  # Main form with transcript selector
│   │   │   ├── SummaryDisplay.tsx  # Summary viewer
│   │   │   ├── TranscriptComparison.tsx  # Side-by-side comparison
│   │   │   ├── TranscriptSelector.tsx  # YouTube/Whisper selector
│   │   │   └── TranscriptViewer.tsx  # Transcript display
│   │   ├── config/            # Configuration and settings ✅ NEW
│   │   │   └── app.config.ts  # App-wide configuration including auth
│   │   ├── contexts/          # React contexts
│   │   │   └── AuthContext.tsx  # Global authentication state
│   │   ├── hooks/             # Custom React hooks
│   │   │   ├── useBatchProcessing.ts  # Batch operations
│   │   │   ├── useTranscriptSelector.ts  # Transcript source logic
│   │   │   └── useWebSocket.ts  # WebSocket connection
│   │   ├── pages/             # Page components
│   │   │   ├── MainPage.tsx   # Unified main page (replaces Admin/Dashboard) ✅ NEW
│   │   │   ├── HistoryPage.tsx  # Persistent job history page ✅ NEW
│   │   │   ├── BatchProcessingPage.tsx  # Batch UI
│   │   │   ├── auth/          # Authentication pages
│   │   │   │   ├── LoginPage.tsx  # Login form
│   │   │   │   └── RegisterPage.tsx  # Registration form
│   │   ├── types/             # TypeScript definitions
│   │   │   └── index.ts       # Shared type definitions
│   │   ├── utils/             # Utility functions
│   │   ├── App.tsx            # Main app component
│   │   └── main.tsx           # React entry point
│   ├── public/                # Static assets
│   ├── .env.example           # Environment variables template ✅ NEW
│   ├── package.json           # Frontend dependencies
│   └── vite.config.ts         # Vite configuration
│
├── video_storage/              # Media storage directories (auto-created)
│   ├── audio/                 # Audio files for re-transcription
│   │   ├── *.mp3             # MP3 audio files (192kbps)
│   │   └── *_metadata.json    # Audio metadata and settings
│   ├── cache/                 # API response caching
│   ├── summaries/             # Generated AI summaries
│   ├── temp/                  # Temporary processing files
│   ├── transcripts/           # Extracted transcripts
│   │   ├── *.txt             # Plain text transcripts
│   │   └── *.json            # Structured transcript data
│   └── videos/                # Downloaded video files
│
├── data/                       # Database and application data
│   ├── app.db                 # SQLite database
│   └── cache/                 # Local cache storage
│
├── scripts/                    # Utility scripts
│   ├── setup_test_env.sh      # Test environment setup
│   └── validate_test_setup.py # Test configuration validator
│
├── migrations/                 # Alembic database migrations
│   └── versions/              # Migration version files
│
├── docs/                       # Project documentation
│   ├── architecture.md        # System architecture
│   ├── prd.md                # Product requirements
│   ├── stories/              # Development stories
│   └── TESTING-INSTRUCTIONS.md  # Test guidelines
│
├── .env.example               # Environment template
├── .gitignore                # Git exclusions
├── CHANGELOG.md              # Version history
├── CLAUDE.md                 # AI development guidance
├── docker-compose.yml        # Docker services
├── Dockerfile                # Container configuration
├── README.md                 # Project documentation
├── requirements.txt          # Python dependencies
└── run_tests.sh             # Test runner script
```

## Key Directories

### Backend Services (`backend/services/`)
Core business logic implementing the 9-tier transcript extraction fallback chain:
1. **YouTube Transcript API** - Primary method using official API
2. **Auto-generated Captions** - YouTube's automatic captions
3. **Whisper AI Transcription** - OpenAI Whisper for audio
4. **PyTubeFix Downloader** - Alternative YouTube library
5. **YT-DLP Downloader** - Robust video/audio extraction
6. **Playwright Browser** - Browser automation fallback
7. **External Tools** - 4K Video Downloader integration
8. **Web Services** - Third-party transcript APIs
9. **Transcript-Only** - Metadata without full transcript

### Storage Structure (`video_storage/`)
Organized media storage with audio retention for re-transcription:
- **audio/** - MP3 files (192kbps) with metadata for future enhanced transcription
- **transcripts/** - Text and JSON transcripts from all sources
- **summaries/** - AI-generated summaries in multiple formats
- **cache/** - Cached API responses for performance
- **temp/** - Temporary files during processing
- **videos/** - Optional video file storage

### Frontend Components (`frontend/src/components/`)
- **TranscriptSelector** - Radio button UI for choosing YouTube/Whisper/Both
- **TranscriptComparison** - Side-by-side quality analysis
- **ProcessingProgress** - Real-time WebSocket progress updates
- **SummarizeForm** - Main interface with source selection

### Database Models (`backend/models/`)
- **User** - Authentication and user management
- **Summary** - Video summaries with transcripts
- **BatchJob** - Batch processing management
- **RefreshToken** - JWT refresh token storage

## Configuration Files

### Environment Variables (`.env`)
```bash
# Core Configuration
USE_MOCK_SERVICES=false
ENABLE_REAL_TRANSCRIPT_EXTRACTION=true

# API Keys
YOUTUBE_API_KEY=your_key
GOOGLE_API_KEY=your_gemini_key
ANTHROPIC_API_KEY=your_claude_key

# Storage Configuration
VIDEO_DOWNLOAD_STORAGE_PATH=./video_storage
VIDEO_DOWNLOAD_KEEP_AUDIO_FILES=true
VIDEO_DOWNLOAD_AUDIO_CLEANUP_DAYS=30
```

### Video Download Config (`backend/config/video_download_config.py`)
- Storage paths and limits
- Download method priorities
- Audio retention settings
- Fallback chain configuration

## Testing Infrastructure

### Test Runner (`run_tests.sh`)
Comprehensive test execution with 229+ unit tests:
- Fast unit tests (~0.2s)
- Integration tests
- Coverage reporting
- Parallel execution

### Test Categories
- **unit/** - Isolated service tests
- **integration/** - API endpoint tests
- **auth/** - Authentication tests
- **pipeline/** - End-to-end tests

## Development Workflows

### Quick Start
```bash
# Backend
cd backend
source venv/bin/activate
python main.py

# Frontend
cd frontend
npm install
npm run dev

# Testing
./run_tests.sh run-unit --fail-fast
```

### Admin Testing
Direct access without authentication:
```
http://localhost:3002/admin
```

### Protected App
Full application with authentication:
```
http://localhost:3002/dashboard
```

## Key Features

### Transcript Extraction
- 9-tier fallback chain for reliability
- YouTube captions and Whisper AI options
- Quality comparison and analysis
- Processing time estimation

### Audio Retention
- Automatic audio saving as MP3
- Metadata tracking for re-transcription
- Configurable retention period
- WAV to MP3 conversion

### Real-time Updates
- WebSocket progress tracking
- Stage-based pipeline monitoring
- Job cancellation support
- Connection recovery

### Batch Processing
- Process up to 100 videos
- Sequential queue management
- Progress tracking per item
- ZIP export with organization

## API Endpoints

### Core Pipeline
- `POST /api/pipeline/process` - Start video processing
- `GET /api/pipeline/status/{job_id}` - Check job status
- `GET /api/pipeline/result/{job_id}` - Get results

### Dual Transcripts
- `POST /api/transcripts/dual/extract` - Extract with options
- `GET /api/transcripts/dual/compare/{video_id}` - Compare sources

### Authentication
- `POST /api/auth/register` - User registration
- `POST /api/auth/login` - User login
- `POST /api/auth/refresh` - Token refresh

### Batch Operations
- `POST /api/batch/jobs` - Create batch job
- `GET /api/batch/jobs/{job_id}` - Job status
- `GET /api/batch/export/{job_id}` - Export results

### Enhanced Export System ✅ Story 4.4
- `POST /api/export/enhanced` - Generate professional export with AI intelligence
- `GET /api/export/config` - Available export configuration options
- `POST /api/export/templates` - Create custom prompt templates
- `GET /api/export/templates` - List and filter domain templates
- `POST /api/export/recommendations` - Get domain-specific template recommendations
- `GET /api/export/templates/{id}/analytics` - Template performance metrics
- `GET /api/export/system/stats` - Overall system statistics

## Database Schema

### Core Tables
- `users` - User accounts and profiles
- `summaries` - Video summaries and metadata
- `refresh_tokens` - JWT refresh tokens
- `batch_jobs` - Batch processing jobs
- `batch_job_items` - Individual batch items

## Docker Services

### docker-compose.yml
```yaml
services:
  backend:
    build: .
    ports: ["8000:8000"]
    volumes: ["./video_storage:/app/video_storage"]

  frontend:
    build: ./frontend
    ports: ["3002:3002"]

  redis:
    image: redis:alpine
    ports: ["6379:6379"]
```

## Version History

- **v5.1.0** - 9-tier fallback chain, audio retention
- **v5.0.0** - MCP server, SDKs, agent frameworks
- **v4.1.0** - Dual transcript options
- **v3.5.0** - Real-time WebSocket updates
- **v3.4.0** - Batch processing
- **v3.3.0** - Summary history
- **v3.2.0** - Frontend authentication
- **v3.1.0** - Backend authentication

---

*Last updated: 2025-08-27 - Added transcript fallback chain and audio retention features*