6.2 KiB
Story 1.5: Video Download and Storage Service
Status
Done
Story
As a user
I want the system to download and store YouTube videos locally
So that I can process videos offline, build a personal archive, and get better transcription quality
Acceptance Criteria
- System downloads videos using yt-dlp with configurable quality settings
- Audio is automatically extracted from videos for transcription purposes
- Videos are stored in an organized directory structure by video ID
- Download progress is tracked and displayed to users in real-time
- Duplicate downloads are prevented through cache checking
- Storage management enforces size limits and provides cleanup options
- Service integrates seamlessly with existing transcript extraction (Story 1.3)
Tasks / Subtasks
-
Task 1: Video Download Service Core (AC: 1, 2, 3)
- Create
VideoDownloadServiceclass inbackend/services/video_download_service.py - Implement yt-dlp integration for video downloading
- Add audio extraction using FFmpeg or yt-dlp post-processors
- Create organized storage directory structure (
/data/youtube-videos/{video_id}/)
- Create
-
Task 2: Cache and Duplicate Prevention (AC: 5)
- Implement download cache tracking in JSON file
- Create video hash generation for unique identification
- Add duplicate detection before download attempts
- Implement cache invalidation and refresh logic
-
Task 3: Storage Management (AC: 6)
- Implement storage size calculation and monitoring
- Create automatic cleanup for oldest videos when limits exceeded
- Add configurable storage limits (default 10GB)
- Implement manual cleanup endpoints for user control
-
Task 4: Progress Tracking (AC: 4)
- Implement yt-dlp progress hooks for download tracking
- Create WebSocket endpoint for real-time progress updates
- Add progress state management for multiple concurrent downloads
- Implement download queue with status tracking
-
Task 5: API Integration (AC: 1, 4, 6)
- Create
/api/videos/downloadPOST endpoint - Implement
/api/videos/statsGET endpoint for storage statistics - Add
/api/videos/cleanupPOST endpoint for manual cleanup - Create background task support for non-blocking downloads
- Create
-
Task 6: Transcript Service Integration (AC: 7)
- Enhance
TranscriptServiceto check for local video files first - Implement local audio transcription using Whisper or similar
- Add fallback chain: local file → YouTube API → audio download + transcription
- Update transcript caching to reference local files when available
- Enhance
-
Task 7: Testing (AC: 1-7)
- Unit tests for VideoDownloadService class
- Integration tests with small test videos
- Test storage management and cleanup logic
- Verify transcript integration works with local files
Dev Notes
Architecture Context
This story adds a foundational capability that enhances the entire application by enabling offline processing and better quality transcription through local file access. It serves as an optional but powerful enhancement to the transcript extraction service.
File Locations and Structure
Backend Files:
backend/services/video_download_service.py- Main download servicebackend/services/storage_manager.py- Storage management utilitiesbackend/api/videos.py- Video download API endpointsbackend/models/video.py- Video data modelsbackend/tests/unit/test_video_download_service.py- Unit testsbackend/tests/integration/test_video_api.py- Integration tests
Storage Structure:
data/
└── youtube-videos/
├── download_cache.json
└── {video_id}/
├── video.mp4
├── audio.mp3
├── metadata.json
└── thumbnail.jpg
Configuration
# Environment variables (.env)
VIDEO_STORAGE_DIR="data/youtube-videos"
MAX_STORAGE_SIZE_GB=10.0
VIDEO_DOWNLOAD_QUALITY="720p" # or "best", "1080p", etc.
ENABLE_LOCAL_TRANSCRIPTION=true
KEEP_VIDEOS_AFTER_PROCESSING=true
Integration Points
- Story 1.2 (URL Validation): Validates URLs before download
- Story 1.3 (Transcript Extraction): Provides local files for transcription
- Epic 2: AI summarization can work with higher quality local transcripts
Technical Requirements
- yt-dlp: Already installed (
/opt/homebrew/bin/yt-dlp) - FFmpeg: Required for audio extraction (check availability)
- Disk Space: Minimum 10GB recommended for video storage
- Python packages:
yt-dlp,whisper(optional for transcription)
Security Considerations
- Validate URLs to prevent arbitrary file downloads
- Sanitize file names to prevent path traversal
- Implement rate limiting on download endpoints
- Consider user authentication for future multi-user scenarios
Performance Optimization
- Download videos asynchronously using background tasks
- Implement concurrent download limits to prevent resource exhaustion
- Use streaming for large files to reduce memory usage
- Cache frequently accessed videos for faster retrieval
Change Log
| Date | Version | Description | Author |
|---|---|---|---|
| 2025-01-26 | 1.0 | Initial story creation | Claude Code |
QA Notes
Testing Approach
- Use small, Creative Commons licensed videos for testing
- Mock yt-dlp for unit tests to avoid actual downloads
- Test with various video formats and qualities
- Verify cleanup doesn't delete actively used files
Edge Cases to Test
- Network interruptions during download
- Videos that exceed storage limits
- Private or deleted videos
- Concurrent downloads of the same video
- Disk full scenarios
Performance Benchmarks
- Download speed: Should achieve >1MB/s for typical videos
- Audio extraction: < 30 seconds for 10-minute video
- Storage cleanup: < 5 seconds for 100 videos
- Cache lookup: < 100ms
Story Status: Draft - Ready for review and implementation
Dependencies: Stories 1.1 ✅ (Setup), 1.2 ✅ (URL Validation)
Blocks: Can enhance Story 1.3 (Transcript Extraction) functionality
Priority: High - Foundational capability for offline processing