youtube-summarizer/docs/stories/4.6.rag-powered-video-chat.md

16 KiB

Story 4.6: RAG-Powered Video Chat with ChromaDB

Story Overview

Story ID: 4.6
Epic: 4 - Advanced Intelligence & Developer Platform
Title: RAG-Powered Video Chat with ChromaDB
Status: 📋 READY FOR IMPLEMENTATION
Priority: Medium

Goal: Implement a RAG (Retrieval Augmented Generation) chatbot interface using ChromaDB for semantic search, enabling users to have interactive Q&A conversations with video content using precise timestamp source references.

Value Proposition: Transform passive video consumption into interactive content exploration, allowing users to ask specific questions about video content and receive precise answers with exact timestamp references for verification.

Dependencies:

  • Story 4.4 (Custom AI Models) for AI service infrastructure
  • Existing ChromaDB integration patterns from /tests/framework-comparison/test_langgraph_chromadb.py
  • Transcript extraction system

Estimated Effort: 20 hours

Technical Requirements

Core Features

1. ChromaDB Vector Database

  • Semantic Transcript Chunking: Split transcripts into meaningful chunks with overlap
  • Embedding Storage: Generate and store embeddings for all transcript segments
  • Metadata Preservation: Maintain timestamp, video ID, and section information
  • Vector Search: Semantic similarity search across transcript content
  • Collection Management: Organize embeddings by video, user, or topic

2. RAG Implementation

  • Context Retrieval: Fetch relevant transcript chunks based on user questions
  • Retrieved Chunks: Use existing test patterns from /tests/framework-comparison/
  • Context Window: Optimize context size for AI model limits
  • Relevance Scoring: Rank retrieved chunks by semantic relevance
  • Source Attribution: Maintain clear connection between chunks and timestamps

3. Chat Interface

  • Real-time Q&A: Interactive chat interface for video-specific questions
  • Timestamp References: Every response includes source timestamps like [00:05:23]
  • DeepSeek Integration: AI responses using DeepSeek models (no Anthropic per user requirements)
  • Context Awareness: Maintain conversation context and follow-up questions
  • Visual Design: Clean chat interface integrated with video summary page

4. Enhanced Features

  • Follow-up Suggestions: AI-generated follow-up questions based on content
  • Conversation History: Persistent chat sessions linked to video summaries
  • Export Conversations: Save Q&A sessions as part of video documentation
  • Multi-Video Chat: Ask questions across multiple videos in a playlist

Technical Architecture

RAG System Components

class RAGService:
    def __init__(self):
        self.vector_db = ChromaVectorDB()
        self.embeddings = HuggingFaceEmbeddings()  # Local embeddings
        self.ai_service = DeepSeekService()
        self.chunk_processor = TranscriptChunker()
    
    async def process_video_for_rag(self, video_id: str, transcript: str) -> bool:
        # Chunk transcript into semantic segments
        # Generate embeddings for each chunk
        # Store in ChromaDB with metadata
        # Return success status
    
    async def ask_question(self, video_id: str, question: str, chat_history: List[ChatMessage]) -> ChatResponse:
        # Retrieve relevant chunks using semantic search
        # Build context from retrieved chunks
        # Generate response with DeepSeek
        # Format response with timestamp references

Database Schema Extensions

-- Chat sessions for persistent conversations
CREATE TABLE chat_sessions (
    id UUID PRIMARY KEY,
    user_id UUID REFERENCES users(id),
    video_id VARCHAR(20),
    summary_id UUID REFERENCES summaries(id),
    session_name VARCHAR(200),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    total_messages INTEGER DEFAULT 0,
    is_active BOOLEAN DEFAULT TRUE
);

-- Individual chat messages
CREATE TABLE chat_messages (
    id UUID PRIMARY KEY,
    session_id UUID REFERENCES chat_sessions(id),
    message_type VARCHAR(20), -- 'user', 'assistant', 'system'
    content TEXT,
    sources JSONB, -- Array of {chunk_id, timestamp, relevance_score}
    processing_time_seconds FLOAT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Vector embeddings for RAG (ChromaDB metadata reference)
CREATE TABLE video_chunks (
    id UUID PRIMARY KEY,
    video_id VARCHAR(20),
    chunk_index INTEGER,
    chunk_text TEXT,
    start_timestamp INTEGER, -- seconds
    end_timestamp INTEGER,
    word_count INTEGER,
    embedding_id VARCHAR(100), -- ChromaDB document ID
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- RAG performance tracking
CREATE TABLE rag_analytics (
    id UUID PRIMARY KEY,
    video_id VARCHAR(20),
    question TEXT,
    retrieval_count INTEGER,
    relevance_scores JSONB,
    response_quality_score FLOAT,
    user_feedback INTEGER, -- 1-5 rating
    processing_time_seconds FLOAT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Implementation Tasks

Task 4.6.1: ChromaDB Vector Database Setup (6 hours)

Subtasks:

  1. ChromaDB Configuration (2 hours)

    • Set up ChromaDB client with persistent storage
    • Configure collections for video transcripts
    • Implement collection naming and organization strategy
    • Add cleanup and maintenance procedures
    • Test database initialization and connection
  2. Transcript Chunking Service (2 hours)

    • Create intelligent transcript segmentation algorithm
    • Implement overlapping chunks for context preservation
    • Extract meaningful chunk boundaries (sentence/paragraph breaks)
    • Preserve timestamp information in chunks
    • Handle various transcript formats and quality levels
  3. Embedding Generation and Storage (2 hours)

    • Integrate HuggingFace embeddings (sentence-transformers/all-MiniLM-L6-v2)
    • Generate embeddings for transcript chunks
    • Store embeddings with metadata in ChromaDB
    • Implement batch processing for large transcripts
    • Add progress tracking for embedding generation

Task 4.6.2: RAG Retrieval System (8 hours)

Subtasks:

  1. Semantic Search Implementation (3 hours)

    • Implement similarity search across video chunks
    • Add relevance scoring and ranking algorithms
    • Configure search parameters (number of results, similarity threshold)
    • Handle edge cases (no relevant chunks, low similarity scores)
    • Test search quality with various question types
  2. Context Building Service (2 hours)

    • Aggregate retrieved chunks into coherent context
    • Implement context window management for AI models
    • Preserve chunk ordering and timestamp information
    • Add context summarization for long retrievals
    • Handle overlapping chunks and deduplication
  3. Source Attribution System (2 hours)

    • Link retrieved chunks to specific timestamps
    • Generate clickable timestamp references [00:05:23]
    • Create YouTube deep links for timestamp navigation
    • Implement source verification and quality checks
    • Add confidence scoring for source attribution
  4. RAG Response Generation (1 hour)

    • Integrate DeepSeek AI service for response generation
    • Create RAG-specific prompts with context and question
    • Format responses with proper source citations
    • Handle cases where no relevant context is found
    • Add response quality validation

Task 4.6.3: Chat Interface Implementation (4 hours)

Subtasks:

  1. Chat Frontend Component (2 hours)

    • Create interactive chat interface with message history
    • Implement typing indicators and loading states
    • Add timestamp link rendering and click handling
    • Design responsive chat layout for video summary pages
    • Add keyboard shortcuts and accessibility features
  2. Chat Session Management (1 hour)

    • Implement persistent chat sessions linked to videos
    • Add session creation, saving, and loading
    • Create chat session list and management interface
    • Handle session state and conversation context
    • Add session export and sharing functionality
  3. Follow-up Question System (1 hour)

    • Generate AI-powered follow-up question suggestions
    • Base suggestions on video content and conversation context
    • Display suggested questions as clickable options
    • Track suggestion effectiveness and user engagement
    • Add customizable suggestion preferences

Task 4.6.4: API Integration and Enhancement (2 hours)

Subtasks:

  1. RAG API Endpoints (1 hour)

    • POST /api/rag/chat/{video_id} - Ask question about specific video
    • GET /api/rag/sessions/{user_id} - Get user's chat sessions
    • POST /api/rag/sessions/{session_id}/export - Export conversation
    • GET /api/rag/suggestions/{video_id} - Get follow-up suggestions
    • Add comprehensive error handling and validation
  2. Performance Optimization (1 hour)

    • Implement caching for frequent questions and responses
    • Add batch processing for multiple questions
    • Optimize ChromaDB queries and connection management
    • Add response streaming for long AI responses
    • Monitor and optimize response times and resource usage

Data Models

RAG Chat Models

from pydantic import BaseModel
from typing import List, Dict, Optional, Any
from datetime import datetime
from enum import Enum

class MessageType(str, Enum):
    USER = "user"
    ASSISTANT = "assistant"
    SYSTEM = "system"

class SourceReference(BaseModel):
    chunk_id: str
    timestamp: int  # seconds
    timestamp_formatted: str  # [HH:MM:SS]
    youtube_link: str
    chunk_text: str
    relevance_score: float

class ChatMessage(BaseModel):
    id: str
    message_type: MessageType
    content: str
    sources: List[SourceReference]
    processing_time_seconds: float
    created_at: datetime

class ChatSession(BaseModel):
    id: str
    user_id: str
    video_id: str
    summary_id: str
    session_name: str
    messages: List[ChatMessage]
    total_messages: int
    is_active: bool
    created_at: datetime
    updated_at: datetime

class ChatRequest(BaseModel):
    video_id: str
    question: str
    session_id: Optional[str] = None
    include_context: bool = True
    max_sources: int = 5

class ChatResponse(BaseModel):
    session_id: str
    message: ChatMessage
    follow_up_suggestions: List[str]
    context_retrieved: bool
    total_chunks_searched: int

class RAGAnalytics(BaseModel):
    question: str
    retrieval_count: int
    relevance_scores: List[float]
    response_quality_score: float
    processing_time_seconds: float
    user_feedback: Optional[int] = None

Testing Strategy

Unit Tests

  • ChromaDB Integration: Connection, storage, and retrieval operations
  • Transcript Chunking: Segmentation quality and metadata preservation
  • Embedding Generation: Vector quality and consistency
  • Semantic Search: Relevance and ranking accuracy
  • Source Attribution: Timestamp accuracy and link generation

Integration Tests

  • RAG Pipeline: End-to-end question answering workflow
  • Chat API: All chat and session management endpoints
  • Frontend Integration: Chat interface functionality and state management
  • Database Operations: Session and message persistence

Quality Assurance Tests

  • Answer Relevance: Semantic accuracy of responses to questions
  • Source Attribution: Timestamp precision and link functionality
  • Response Quality: Coherence and helpfulness of AI responses
  • Performance: Response time and resource usage under load

API Specification

RAG Chat Endpoints

/api/rag/chat/{video_id}:
  post:
    summary: Ask question about video content using RAG
    parameters:
      - name: video_id
        in: path
        required: true
        schema:
          type: string
    requestBody:
      required: true
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/ChatRequest'
    responses:
      200:
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/ChatResponse'

/api/rag/sessions/{user_id}:
  get:
    summary: Get user's chat sessions
    parameters:
      - name: user_id
        in: path
        required: true
        schema:
          type: string
      - name: active_only
        in: query
        schema:
          type: boolean
          default: true
    responses:
      200:
        content:
          application/json:
            schema:
              type: array
              items:
                $ref: '#/components/schemas/ChatSession'

/api/rag/embeddings/{video_id}/generate:
  post:
    summary: Generate embeddings for video transcript
    parameters:
      - name: video_id
        in: path
        required: true
        schema:
          type: string
    responses:
      202:
        content:
          application/json:
            schema:
              type: object
              properties:
                job_id:
                  type: string
                status:
                  type: string
                estimated_completion:
                  type: string
                  format: date-time

Success Criteria

Functional Requirements

  • ChromaDB stores transcript embeddings with timestamp metadata
  • Semantic search retrieves relevant content chunks for user questions
  • Chat interface provides real-time Q&A with timestamp source references
  • DeepSeek AI generates contextual responses using retrieved chunks
  • Follow-up question suggestions based on video content
  • Persistent chat sessions linked to specific videos

Quality Requirements

  • Answer relevance >85% for factual questions about video content
  • Timestamp references accurate within 10-second tolerance
  • Source attribution clearly links responses to specific video segments
  • Response quality maintains conversation context across messages
  • Follow-up suggestions are relevant and engaging
  • Chat interface provides smooth user experience with loading states

Performance Requirements

  • Question answering response time under 8 seconds
  • ChromaDB search completes in under 2 seconds
  • Embedding generation processes 1-hour video in under 5 minutes
  • Chat interface supports concurrent conversations without degradation
  • Memory usage remains stable during long conversation sessions

Implementation Notes

ChromaDB Integration

  • Use existing patterns from /tests/framework-comparison/test_langgraph_chromadb.py
  • Implement HuggingFace embeddings for local processing (no API dependencies)
  • Configure persistent storage in ./data/chromadb_rag/ directory
  • Use collection per video or organized by user/topic as needed

Transcript Chunking Strategy

  • Create semantic chunks of 200-400 words with 50-word overlap
  • Preserve sentence boundaries and paragraph structure
  • Maintain timestamp ranges for each chunk
  • Include video context (title, channel) in chunk metadata

RAG Response Pattern

  • Retrieve 3-5 most relevant chunks for context
  • Include source timestamps in response format: "According to the video at [00:05:23], ..."
  • Provide YouTube deep links for timestamp navigation
  • Handle cases where no relevant content is found gracefully

DeepSeek Integration

  • Use DeepSeek API for response generation (per user requirement: no Anthropic)
  • Configure appropriate model parameters for conversational responses
  • Implement cost tracking and usage monitoring
  • Add response quality scoring and feedback collection

Risk Mitigation

High Risk: Answer Quality and Relevance

  • Risk: RAG responses may be generic or miss important context
  • Mitigation: Quality scoring, user feedback collection, continuous prompt optimization

Medium Risk: Timestamp Accuracy

  • Risk: Source timestamps may not accurately reflect quoted content
  • Mitigation: Chunk boundary validation, timestamp verification, user correction system

Medium Risk: Performance with Large Videos

  • Risk: Long videos may cause slow embedding generation and search
  • Mitigation: Batch processing, progress tracking, optimized chunking strategies

Story Owner: Development Team
Architecture Reference: BMad Method Epic-Story Structure
Implementation Status: Ready for Development
Last Updated: 2025-08-27