6.2 KiB

Raw Permalink Blame History

Story 1.5: Video Download and Storage Service

Status

Done

Story

As a user
I want the system to download and store YouTube videos locally
So that I can process videos offline, build a personal archive, and get better transcription quality

Acceptance Criteria

System downloads videos using yt-dlp with configurable quality settings
Audio is automatically extracted from videos for transcription purposes
Videos are stored in an organized directory structure by video ID
Download progress is tracked and displayed to users in real-time
Duplicate downloads are prevented through cache checking
Storage management enforces size limits and provides cleanup options
Service integrates seamlessly with existing transcript extraction (Story 1.3)

Tasks / Subtasks

Task 1: Video Download Service Core (AC: 1, 2, 3)
- Create VideoDownloadService class in backend/services/video_download_service.py
- Implement yt-dlp integration for video downloading
- Add audio extraction using FFmpeg or yt-dlp post-processors
- Create organized storage directory structure (/data/youtube-videos/{video_id}/)
Task 2: Cache and Duplicate Prevention (AC: 5)
- Implement download cache tracking in JSON file
- Create video hash generation for unique identification
- Add duplicate detection before download attempts
- Implement cache invalidation and refresh logic
Task 3: Storage Management (AC: 6)
- Implement storage size calculation and monitoring
- Create automatic cleanup for oldest videos when limits exceeded
- Add configurable storage limits (default 10GB)
- Implement manual cleanup endpoints for user control
Task 4: Progress Tracking (AC: 4)
- Implement yt-dlp progress hooks for download tracking
- Create WebSocket endpoint for real-time progress updates
- Add progress state management for multiple concurrent downloads
- Implement download queue with status tracking
Task 5: API Integration (AC: 1, 4, 6)
- Create /api/videos/download POST endpoint
- Implement /api/videos/stats GET endpoint for storage statistics
- Add /api/videos/cleanup POST endpoint for manual cleanup
- Create background task support for non-blocking downloads
Task 6: Transcript Service Integration (AC: 7)
- Enhance TranscriptService to check for local video files first
- Implement local audio transcription using Whisper or similar
- Add fallback chain: local file → YouTube API → audio download + transcription
- Update transcript caching to reference local files when available
Task 7: Testing (AC: 1-7)
- Unit tests for VideoDownloadService class
- Integration tests with small test videos
- Test storage management and cleanup logic
- Verify transcript integration works with local files

Dev Notes

Architecture Context

This story adds a foundational capability that enhances the entire application by enabling offline processing and better quality transcription through local file access. It serves as an optional but powerful enhancement to the transcript extraction service.

File Locations and Structure

Backend Files:

backend/services/video_download_service.py - Main download service
backend/services/storage_manager.py - Storage management utilities
backend/api/videos.py - Video download API endpoints
backend/models/video.py - Video data models
backend/tests/unit/test_video_download_service.py - Unit tests
backend/tests/integration/test_video_api.py - Integration tests

Storage Structure:

data/
└── youtube-videos/
    ├── download_cache.json
    └── {video_id}/
        ├── video.mp4
        ├── audio.mp3
        ├── metadata.json
        └── thumbnail.jpg

Configuration

# Environment variables (.env)
VIDEO_STORAGE_DIR="data/youtube-videos"
MAX_STORAGE_SIZE_GB=10.0
VIDEO_DOWNLOAD_QUALITY="720p"  # or "best", "1080p", etc.
ENABLE_LOCAL_TRANSCRIPTION=true
KEEP_VIDEOS_AFTER_PROCESSING=true

Integration Points

Story 1.2 (URL Validation): Validates URLs before download
Story 1.3 (Transcript Extraction): Provides local files for transcription
Epic 2: AI summarization can work with higher quality local transcripts

Technical Requirements

yt-dlp: Already installed (/opt/homebrew/bin/yt-dlp)
FFmpeg: Required for audio extraction (check availability)
Disk Space: Minimum 10GB recommended for video storage
Python packages: yt-dlp, whisper (optional for transcription)

Security Considerations

Validate URLs to prevent arbitrary file downloads
Sanitize file names to prevent path traversal
Implement rate limiting on download endpoints
Consider user authentication for future multi-user scenarios

Performance Optimization

Download videos asynchronously using background tasks
Implement concurrent download limits to prevent resource exhaustion
Use streaming for large files to reduce memory usage
Cache frequently accessed videos for faster retrieval

Change Log

Date	Version	Description	Author
2025-01-26	1.0	Initial story creation	Claude Code

QA Notes

Testing Approach

Use small, Creative Commons licensed videos for testing
Mock yt-dlp for unit tests to avoid actual downloads
Test with various video formats and qualities
Verify cleanup doesn't delete actively used files

Edge Cases to Test

Network interruptions during download
Videos that exceed storage limits
Private or deleted videos
Concurrent downloads of the same video
Disk full scenarios

Performance Benchmarks

Download speed: Should achieve >1MB/s for typical videos
Audio extraction: < 30 seconds for 10-minute video
Storage cleanup: < 5 seconds for 100 videos
Cache lookup: < 100ms

Story Status: Draft - Ready for review and implementation
Dependencies: Stories 1.1 ✅ (Setup), 1.2 ✅ (URL Validation)
Blocks: Can enhance Story 1.3 (Transcript Extraction) functionality
Priority: High - Foundational capability for offline processing

6.2 KiB Raw Permalink Blame History