18 KiB
Story 1.3: Transcript Extraction Service
Status
Draft
Story
As a user
I want the system to automatically extract transcripts from YouTube videos
so that I have the text content needed for AI summarization
Acceptance Criteria
- System extracts video transcripts using multiple fallback methods (YouTube Transcript API → auto-captions → audio transcription)
- Transcripts are cached to avoid repeated API calls for the same video
- Multiple language support with preference for English when available
- Failed transcript extraction returns informative error messages with suggested solutions
- System handles videos with no available transcripts gracefully
- Transcript extraction is non-blocking and provides progress feedback
Tasks / Subtasks
-
Task 1: Primary Transcript Extraction (AC: 1, 2)
- Create
TranscriptServiceclass inbackend/services/transcript_service.py - Implement YouTube Transcript API integration with retry logic
- Add transcript caching with video ID-based keys
- Implement multi-language transcript detection and prioritization
- Create
-
Task 2: Fallback Transcript Methods (AC: 1, 5)
- Integrate auto-generated captions extraction as secondary method
- Implement audio transcription fallback using OpenAI Whisper API
- Create fallback chain orchestration with error handling
- Add logging for fallback method usage and success rates
-
Task 3: Transcript Processing Pipeline (AC: 3, 4, 6)
- Create transcript cleaning and formatting utilities
- Implement timestamp preservation for chapter creation
- Add text chunking for large transcripts (token limit management)
- Create progress tracking for multi-step extraction process
-
Task 4: API Integration (AC: 4, 6)
- Create
/api/transcripts/{video_id}GET endpoint - Implement background transcript extraction with job status tracking
- Add WebSocket support for real-time progress updates
- Create comprehensive error response system with recovery suggestions
- Create
-
Task 5: Cache Management (AC: 2)
- Implement Redis-based transcript caching with 24-hour TTL
- Add cache warming for popular videos
- Create cache invalidation strategy for updated transcripts
- Add cache analytics and hit rate monitoring
-
Task 6: Integration Testing (AC: 1, 2, 3, 4, 5, 6)
- Test transcript extraction across different video types and lengths
- Verify fallback chain handles edge cases (private videos, no captions, etc.)
- Test caching behavior and cache invalidation
- Validate error handling and user-facing error messages
Dev Notes
Architecture Context
This story implements the core content extraction layer that bridges YouTube's video platform with our AI summarization engine. The transcript service serves as the foundation for all downstream AI processing and must be robust, efficient, and user-friendly.
Transcript Service Implementation Requirements
[Source: docs/architecture.md#backend-services]
class TranscriptService:
def __init__(self, cache_client: CacheClient, whisper_client: WhisperClient):
self.youtube_api = YouTubeTranscriptApi()
self.cache_client = cache_client
self.whisper_client = whisper_client
async def extract_transcript(self, video_id: str, language_preference: str = "en") -> TranscriptResult:
"""Extract transcript using fallback chain with caching"""
# Check cache first
cache_key = f"transcript:{video_id}:{language_preference}"
cached_result = await self.cache_client.get(cache_key)
if cached_result:
return TranscriptResult.from_cache(cached_result)
# Try primary method: YouTube Transcript API
try:
transcript = await self._extract_youtube_transcript(video_id, language_preference)
await self.cache_client.set(cache_key, transcript, ttl=86400) # 24 hours
return TranscriptResult(transcript=transcript, method="youtube_api", success=True)
except TranscriptNotAvailableError:
pass
# Fallback 1: Auto-generated captions
try:
transcript = await self._extract_auto_captions(video_id, language_preference)
await self.cache_client.set(cache_key, transcript, ttl=86400)
return TranscriptResult(transcript=transcript, method="auto_captions", success=True)
except CaptionsNotAvailableError:
pass
# Fallback 2: Audio transcription with Whisper
try:
transcript = await self._transcribe_audio(video_id, language_preference)
await self.cache_client.set(cache_key, transcript, ttl=86400)
return TranscriptResult(transcript=transcript, method="whisper_audio", success=True)
except AudioTranscriptionError as e:
return TranscriptResult(
transcript=None,
method="failed",
success=False,
error=TranscriptExtractionError(
message="Unable to extract transcript from video",
error_code=ErrorCode.TRANSCRIPT_UNAVAILABLE,
details={
"video_id": video_id,
"attempted_methods": ["youtube_api", "auto_captions", "whisper_audio"],
"last_error": str(e),
"suggestions": [
"Try a different video with captions available",
"Check if video is public and accessible",
"Contact support if this video should have transcripts"
]
}
)
)
Transcript Processing Requirements
[Source: docs/architecture.md#data-processing]
Transcript Cleaning and Formatting:
class TranscriptProcessor:
def clean_transcript(self, raw_transcript: List[Dict]) -> str:
"""Clean and format raw transcript data"""
# Remove duplicate segments
# Fix common OCR/speech recognition errors
# Standardize punctuation and formatting
# Preserve meaningful timestamps
def chunk_transcript(self, transcript: str, max_tokens: int = 3000) -> List[TranscriptChunk]:
"""Split transcript into manageable chunks for AI processing"""
# Split on sentence boundaries
# Preserve context across chunks
# Include timestamp ranges for each chunk
# Ensure chunks don't exceed token limits
def extract_metadata(self, transcript: str) -> TranscriptMetadata:
"""Extract useful metadata from transcript"""
return TranscriptMetadata(
word_count=len(transcript.split()),
estimated_reading_time=self.calculate_reading_time(transcript),
language_detected=self.detect_language(transcript),
topics=self.extract_topics(transcript),
speakers_detected=self.detect_speakers(transcript)
)
Error Handling Requirements
[Source: docs/architecture.md#error-handling]
Transcript-Specific Exceptions:
class TranscriptExtractionError(BaseAPIException):
"""Base exception for transcript extraction failures"""
pass
class TranscriptNotAvailableError(TranscriptExtractionError):
"""No transcript available through any method"""
def __init__(self, video_id: str, attempted_methods: List[str]):
super().__init__(
message=f"No transcript available for video {video_id}",
error_code=ErrorCode.TRANSCRIPT_UNAVAILABLE,
status_code=status.HTTP_404_NOT_FOUND,
details={
"video_id": video_id,
"attempted_methods": attempted_methods,
"recovery_suggestions": [
"Check if video has captions enabled",
"Try a different video",
"Contact video owner to enable captions"
]
}
)
class RateLimitExceededError(TranscriptExtractionError):
"""API rate limit exceeded for transcript service"""
pass
class AudioTranscriptionError(TranscriptExtractionError):
"""Audio transcription failed"""
pass
API Endpoint Specification
[Source: docs/architecture.md#api-specification]
Request/Response Models:
class TranscriptRequest(BaseModel):
video_id: str = Field(..., description="YouTube video ID")
language_preference: str = Field("en", description="Preferred transcript language")
include_metadata: bool = Field(True, description="Include transcript metadata")
class TranscriptResponse(BaseModel):
video_id: str
transcript: Optional[str] = None
metadata: Optional[TranscriptMetadata] = None
extraction_method: str # "youtube_api", "auto_captions", "whisper_audio"
language: str
word_count: int
cached: bool
processing_time_seconds: float
error: Optional[Dict[str, Any]] = None
Endpoint Implementation:
@router.get("/transcripts/{video_id}", response_model=TranscriptResponse)
async def get_transcript(
video_id: str,
language_preference: str = "en",
include_metadata: bool = True,
transcript_service: TranscriptService = Depends()
):
start_time = time.time()
try:
result = await transcript_service.extract_transcript(video_id, language_preference)
response_data = {
"video_id": video_id,
"transcript": result.transcript,
"extraction_method": result.method,
"language": result.language,
"word_count": len(result.transcript.split()) if result.transcript else 0,
"cached": result.from_cache,
"processing_time_seconds": time.time() - start_time
}
if include_metadata and result.transcript:
response_data["metadata"] = transcript_service.extract_metadata(result.transcript)
return TranscriptResponse(**response_data)
except TranscriptExtractionError as e:
return TranscriptResponse(
video_id=video_id,
extraction_method="failed",
language=language_preference,
word_count=0,
cached=False,
processing_time_seconds=time.time() - start_time,
error={
"code": e.error_code,
"message": e.message,
"details": e.details
}
)
Background Job Implementation
[Source: docs/architecture.md#background-processing]
Async Transcript Extraction:
@router.post("/transcripts/extract", response_model=JobResponse)
async def extract_transcript_async(
request: TranscriptRequest,
background_tasks: BackgroundTasks,
transcript_service: TranscriptService = Depends()
):
job_id = str(uuid.uuid4())
# Start background extraction
background_tasks.add_task(
extract_transcript_job,
job_id=job_id,
video_id=request.video_id,
language_preference=request.language_preference,
transcript_service=transcript_service
)
return JobResponse(
job_id=job_id,
status="processing",
message="Transcript extraction started"
)
@router.get("/transcripts/jobs/{job_id}", response_model=JobStatusResponse)
async def get_extraction_status(job_id: str):
# Check job status in cache/database
# Return progress updates via WebSocket if available
pass
Cache Strategy Implementation
[Source: docs/architecture.md#caching-strategy]
Multi-Level Caching:
class TranscriptCacheManager:
def __init__(self, redis_client: RedisClient, db_session: Session):
self.redis = redis_client
self.db = db_session
async def get_cached_transcript(self, video_id: str, language: str) -> Optional[str]:
# Level 1: Redis cache (fast, temporary)
cache_key = f"transcript:{video_id}:{language}"
cached = await self.redis.get(cache_key)
if cached:
return json.loads(cached)
# Level 2: Database cache (persistent)
db_transcript = self.db.query(CachedTranscript).filter(
CachedTranscript.video_id == video_id,
CachedTranscript.language == language,
CachedTranscript.expires_at > datetime.utcnow()
).first()
if db_transcript:
# Warm Redis cache
await self.redis.setex(cache_key, 86400, db_transcript.content)
return db_transcript.content
return None
async def cache_transcript(self, video_id: str, language: str, transcript: str):
cache_key = f"transcript:{video_id}:{language}"
# Cache in Redis (24 hours)
await self.redis.setex(cache_key, 86400, transcript)
# Cache in database (7 days)
db_transcript = CachedTranscript(
video_id=video_id,
language=language,
content=transcript,
created_at=datetime.utcnow(),
expires_at=datetime.utcnow() + timedelta(days=7)
)
self.db.add(db_transcript)
self.db.commit()
File Locations and Structure
[Source: docs/architecture.md#project-structure]
Backend Files:
backend/services/transcript_service.py- Main transcript extraction servicebackend/services/transcript_processor.py- Transcript cleaning and processingbackend/services/cache_manager.py- Multi-level caching implementationbackend/api/transcripts.py- Transcript API endpointsbackend/core/exceptions.py- Updated with transcript-specific exceptionsbackend/models/transcript.py- Transcript data modelsbackend/tests/unit/test_transcript_service.py- Unit testsbackend/tests/integration/test_transcript_api.py- Integration tests
Testing Standards
Backend Unit Tests
[Source: docs/architecture.md#testing-strategy]
Test File: backend/tests/unit/test_transcript_service.py
class TestTranscriptService:
def test_extract_transcript_success(self):
"""Test successful transcript extraction"""
service = TranscriptService(mock_cache, mock_whisper)
# Mock successful YouTube API response
with patch.object(service, '_extract_youtube_transcript') as mock_extract:
mock_extract.return_value = "Sample transcript content"
result = await service.extract_transcript("dQw4w9WgXcQ")
assert result.success == True
assert result.transcript == "Sample transcript content"
assert result.method == "youtube_api"
def test_fallback_chain(self):
"""Test fallback chain when primary method fails"""
service = TranscriptService(mock_cache, mock_whisper)
# Mock YouTube API failure, auto-captions success
with patch.object(service, '_extract_youtube_transcript') as mock_yt:
mock_yt.side_effect = TranscriptNotAvailableError()
with patch.object(service, '_extract_auto_captions') as mock_auto:
mock_auto.return_value = "Auto-generated transcript"
result = await service.extract_transcript("dQw4w9WgXcQ")
assert result.success == True
assert result.method == "auto_captions"
def test_cache_hit(self):
"""Test transcript retrieval from cache"""
cache = MockCache()
cache.set("transcript:dQw4w9WgXcQ:en", "Cached transcript")
service = TranscriptService(cache, mock_whisper)
result = await service.extract_transcript("dQw4w9WgXcQ")
assert result.from_cache == True
assert result.transcript == "Cached transcript"
Integration Tests
[Source: docs/architecture.md#testing-strategy]
Test File: backend/tests/integration/test_transcript_api.py
class TestTranscriptAPI:
def test_get_transcript_endpoint(self):
"""Test transcript retrieval endpoint"""
response = client.get("/api/transcripts/dQw4w9WgXcQ")
assert response.status_code == 200
data = response.json()
assert "transcript" in data
assert "extraction_method" in data
assert "processing_time_seconds" in data
def test_async_extraction(self):
"""Test background transcript extraction"""
# Start async extraction
response = client.post("/api/transcripts/extract", json={
"video_id": "dQw4w9WgXcQ",
"language_preference": "en"
})
assert response.status_code == 200
job_data = response.json()
job_id = job_data["job_id"]
# Check job status
status_response = client.get(f"/api/transcripts/jobs/{job_id}")
assert status_response.status_code == 200
assert status_response.json()["status"] in ["processing", "completed"]
Performance Optimization
- Caching Strategy: Multi-level caching reduces API calls by 90%+ for popular videos
- Async Processing: Non-blocking extraction prevents UI freezing
- Smart Fallbacks: Fastest methods tried first, expensive audio transcription last
- Token Management: Transcript chunking prevents AI model token limit issues
- Rate Limiting: Exponential backoff for API rate limit handling
Security Considerations
- API Key Management: All external API keys stored securely in environment variables
- Input Validation: Video ID format validation before processing
- Rate Limiting: Per-IP limits to prevent abuse of transcript extraction
- Content Filtering: Optional content filtering for inappropriate transcripts
- Cache Security: Encrypted cache keys and secure Redis configuration
Change Log
| Date | Version | Description | Author |
|---|---|---|---|
| 2025-01-25 | 1.0 | Initial story creation | Bob (Scrum Master) |
Dev Agent Record
This section will be populated by the development agent during implementation
Agent Model Used
To be filled by dev agent
Debug Log References
To be filled by dev agent
Completion Notes List
To be filled by dev agent
File List
To be filled by dev agent
QA Results
Results from QA Agent review of the completed story implementation will be added here