youtube-summarizer/docs/stories/1.5.video-download-storage-...

157 lines
6.2 KiB
Markdown

# Story 1.5: Video Download and Storage Service
## Status
Done
## Story
**As a** user
**I want** the system to download and store YouTube videos locally
**So that** I can process videos offline, build a personal archive, and get better transcription quality
## Acceptance Criteria
1. System downloads videos using yt-dlp with configurable quality settings
2. Audio is automatically extracted from videos for transcription purposes
3. Videos are stored in an organized directory structure by video ID
4. Download progress is tracked and displayed to users in real-time
5. Duplicate downloads are prevented through cache checking
6. Storage management enforces size limits and provides cleanup options
7. Service integrates seamlessly with existing transcript extraction (Story 1.3)
## Tasks / Subtasks
- [ ] **Task 1: Video Download Service Core** (AC: 1, 2, 3)
- [ ] Create `VideoDownloadService` class in `backend/services/video_download_service.py`
- [ ] Implement yt-dlp integration for video downloading
- [ ] Add audio extraction using FFmpeg or yt-dlp post-processors
- [ ] Create organized storage directory structure (`/data/youtube-videos/{video_id}/`)
- [ ] **Task 2: Cache and Duplicate Prevention** (AC: 5)
- [ ] Implement download cache tracking in JSON file
- [ ] Create video hash generation for unique identification
- [ ] Add duplicate detection before download attempts
- [ ] Implement cache invalidation and refresh logic
- [ ] **Task 3: Storage Management** (AC: 6)
- [ ] Implement storage size calculation and monitoring
- [ ] Create automatic cleanup for oldest videos when limits exceeded
- [ ] Add configurable storage limits (default 10GB)
- [ ] Implement manual cleanup endpoints for user control
- [ ] **Task 4: Progress Tracking** (AC: 4)
- [ ] Implement yt-dlp progress hooks for download tracking
- [ ] Create WebSocket endpoint for real-time progress updates
- [ ] Add progress state management for multiple concurrent downloads
- [ ] Implement download queue with status tracking
- [ ] **Task 5: API Integration** (AC: 1, 4, 6)
- [ ] Create `/api/videos/download` POST endpoint
- [ ] Implement `/api/videos/stats` GET endpoint for storage statistics
- [ ] Add `/api/videos/cleanup` POST endpoint for manual cleanup
- [ ] Create background task support for non-blocking downloads
- [ ] **Task 6: Transcript Service Integration** (AC: 7)
- [ ] Enhance `TranscriptService` to check for local video files first
- [ ] Implement local audio transcription using Whisper or similar
- [ ] Add fallback chain: local file → YouTube API → audio download + transcription
- [ ] Update transcript caching to reference local files when available
- [ ] **Task 7: Testing** (AC: 1-7)
- [ ] Unit tests for VideoDownloadService class
- [ ] Integration tests with small test videos
- [ ] Test storage management and cleanup logic
- [ ] Verify transcript integration works with local files
## Dev Notes
### Architecture Context
This story adds a foundational capability that enhances the entire application by enabling offline processing and better quality transcription through local file access. It serves as an optional but powerful enhancement to the transcript extraction service.
### File Locations and Structure
**Backend Files**:
- `backend/services/video_download_service.py` - Main download service
- `backend/services/storage_manager.py` - Storage management utilities
- `backend/api/videos.py` - Video download API endpoints
- `backend/models/video.py` - Video data models
- `backend/tests/unit/test_video_download_service.py` - Unit tests
- `backend/tests/integration/test_video_api.py` - Integration tests
**Storage Structure**:
```
data/
└── youtube-videos/
├── download_cache.json
└── {video_id}/
├── video.mp4
├── audio.mp3
├── metadata.json
└── thumbnail.jpg
```
### Configuration
```python
# Environment variables (.env)
VIDEO_STORAGE_DIR="data/youtube-videos"
MAX_STORAGE_SIZE_GB=10.0
VIDEO_DOWNLOAD_QUALITY="720p" # or "best", "1080p", etc.
ENABLE_LOCAL_TRANSCRIPTION=true
KEEP_VIDEOS_AFTER_PROCESSING=true
```
### Integration Points
- **Story 1.2** (URL Validation): Validates URLs before download
- **Story 1.3** (Transcript Extraction): Provides local files for transcription
- **Epic 2**: AI summarization can work with higher quality local transcripts
### Technical Requirements
- **yt-dlp**: Already installed (`/opt/homebrew/bin/yt-dlp`)
- **FFmpeg**: Required for audio extraction (check availability)
- **Disk Space**: Minimum 10GB recommended for video storage
- **Python packages**: `yt-dlp`, `whisper` (optional for transcription)
### Security Considerations
- Validate URLs to prevent arbitrary file downloads
- Sanitize file names to prevent path traversal
- Implement rate limiting on download endpoints
- Consider user authentication for future multi-user scenarios
### Performance Optimization
- Download videos asynchronously using background tasks
- Implement concurrent download limits to prevent resource exhaustion
- Use streaming for large files to reduce memory usage
- Cache frequently accessed videos for faster retrieval
## Change Log
| Date | Version | Description | Author |
|------|---------|-------------|--------|
| 2025-01-26 | 1.0 | Initial story creation | Claude Code |
## QA Notes
### Testing Approach
- Use small, Creative Commons licensed videos for testing
- Mock yt-dlp for unit tests to avoid actual downloads
- Test with various video formats and qualities
- Verify cleanup doesn't delete actively used files
### Edge Cases to Test
- Network interruptions during download
- Videos that exceed storage limits
- Private or deleted videos
- Concurrent downloads of the same video
- Disk full scenarios
### Performance Benchmarks
- Download speed: Should achieve >1MB/s for typical videos
- Audio extraction: < 30 seconds for 10-minute video
- Storage cleanup: < 5 seconds for 100 videos
- Cache lookup: < 100ms
---
**Story Status**: Draft - Ready for review and implementation
**Dependencies**: Stories 1.1 (Setup), 1.2 (URL Validation)
**Blocks**: Can enhance Story 1.3 (Transcript Extraction) functionality
**Priority**: High - Foundational capability for offline processing