trax/.taskmaster/docs/research/2025-08-30_youtube-metadata...

12 KiB

title query date time timestamp exchanges
Research Session YouTube metadata extraction using curl regex patterns 2024 current page structure title channel description duration 8/30/2025 7:51:55 AM 2025-08-30T11:51:55.698Z 1

Research Session

Initial Query

Question: YouTube metadata extraction using curl regex patterns 2024 current page structure title channel description duration

Response:

YouTube Metadata Extraction Using Curl and Regex Patterns (2024)

Introduction

Extracting metadata from YouTube videos programmatically is a valuable capability for many applications, including media analysis, content aggregation, and data enrichment. For the Trax platform, this capability could enhance the transcription service by automatically gathering contextual information about media files sourced from YouTube. This research explores how to extract key metadata elements (title, channel, description, and duration) from YouTube pages using curl and regex patterns, based on YouTube's current (2024) page structure.

The approach outlined here uses lightweight HTTP requests via curl combined with targeted regular expressions to parse the HTML and JavaScript content of YouTube pages. This method avoids the overhead of browser automation tools while still providing reliable extraction of essential metadata.

Current YouTube Page Structure (2024)

YouTube's frontend has evolved significantly over the years, moving from simple HTML to a complex React-based application. As of 2024, YouTube pages are heavily JavaScript-driven, with most content dynamically loaded. However, for SEO purposes, YouTube still includes critical metadata in the initial HTML response or within embedded JavaScript objects.

The key locations where metadata can be found include:

  1. Initial HTML Response: Contains basic metadata in meta tags and JSON-LD structured data
  2. Embedded JavaScript Objects: Contains detailed video information in ytInitialData and ytInitialPlayerResponse objects
  3. Video Player Configuration: Contains technical details about the video including duration
  4. Description Section: Rendered HTML containing the full description

Understanding these locations is crucial for crafting effective regex patterns that can reliably extract the desired information.

Basic Curl Command for YouTube Pages

To begin extracting metadata, we need to fetch the YouTube page content. The following curl command provides a good starting point:

curl -s -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36" "https://www.youtube.com/watch?v=VIDEO_ID"

Key components:

  • -s: Silent mode to suppress progress meter
  • -A: User-Agent string to mimic a modern browser (important as YouTube may serve different content to different user agents)
  • The URL with the video ID parameter

This command returns the full HTML of the YouTube page, which we can then parse with regex patterns.

Extracting Video Title

The video title can be extracted from multiple locations in the page. The most reliable approaches are:

Method 1: From meta tags

curl -s -A "Mozilla/5.0" "https://www.youtube.com/watch?v=VIDEO_ID" | grep -o '<meta name="title" content="[^"]*"' | sed 's/<meta name="title" content="\(.*\)"/\1/'

Method 2: From ytInitialPlayerResponse

curl -s -A "Mozilla/5.0" "https://www.youtube.com/watch?v=VIDEO_ID" | grep -o '"ytInitialPlayerResponse":.*"title":"[^"]*"' | sed 's/.*"title":"//;s/".*$//'

Method 3: Using a more robust regex pattern

curl -s -A "Mozilla/5.0" "https://www.youtube.com/watch?v=VIDEO_ID" | perl -ne 'print $1 if /"title":"([^"]+?)(?<!\\)"/' 

The third method is more robust as it handles potential escape characters in the title and uses Perl's regex engine for better pattern matching.

Extracting Channel Information

Channel information is typically available in the embedded JavaScript objects:

Channel Name

curl -s -A "Mozilla/5.0" "https://www.youtube.com/watch?v=VIDEO_ID" | perl -ne 'print $1 if /"ownerChannelName":"([^"]+?)(?<!\\)"/'

Channel ID

curl -s -A "Mozilla/5.0" "https://www.youtube.com/watch?v=VIDEO_ID" | perl -ne 'print $1 if /"channelId":"([^"]+?)(?<!\\)"/'

Channel URL

curl -s -A "Mozilla/5.0" "https://www.youtube.com/watch?v=VIDEO_ID" | perl -ne 'print $1 if /"channelUrl":"([^"]+?)(?<!\\)"/'

These patterns target the ytInitialData object, which contains detailed information about the channel.

Extracting Video Description

The video description is one of the more challenging elements to extract reliably, as it can contain multiple lines, special characters, and HTML entities:

curl -s -A "Mozilla/5.0" "https://www.youtube.com/watch?v=VIDEO_ID" | perl -ne 'if (/"shortDescription":"(.*?)(?<!\\)"/s) { $desc = $1; $desc =~ s/\\n/\n/g; $desc =~ s/\\"/"/g; print $desc; exit; }'

For a more robust approach that handles longer descriptions:

curl -s -A "Mozilla/5.0" "https://www.youtube.com/watch?v=VIDEO_ID" | perl -0777 -ne 'if (/"description":{"simpleText":"(.*?)(?<!\\)"}/s) { $desc = $1; $desc =~ s/\\n/\n/g; $desc =~ s/\\"/"/g; print $desc; exit; } elsif (/"description":{"runs":\[(.*?)\]}/s) { $runs = $1; while ($runs =~ /"text":"(.*?)(?<!\\)"/gs) { print $1; } exit; }'

This pattern handles both simple text descriptions and the more complex "runs" format that YouTube uses for descriptions with formatting.

Extracting Video Duration

Video duration can be extracted from multiple locations:

Method 1: From meta tags

curl -s -A "Mozilla/5.0" "https://www.youtube.com/watch?v=VIDEO_ID" | grep -o '<meta itemprop="duration" content="[^"]*"' | sed 's/<meta itemprop="duration" content="\(.*\)"/\1/'

Method 2: From ytInitialPlayerResponse

curl -s -A "Mozilla/5.0" "https://www.youtube.com/watch?v=VIDEO_ID" | perl -ne 'print $1 if /"lengthSeconds":"(\d+)"/'

The second method returns the duration in seconds, which is more useful for programmatic processing. You can convert this to a formatted duration as needed.

Comprehensive Extraction Script

Here's a comprehensive bash script that extracts all the required metadata elements:

#!/bin/bash

VIDEO_ID=$1
USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"

if [ -z "$VIDEO_ID" ]; then
    echo "Usage: $0 VIDEO_ID"
    exit 1
fi

# Fetch the page content
PAGE_CONTENT=$(curl -s -A "$USER_AGENT" "https://www.youtube.com/watch?v=$VIDEO_ID")

# Extract title
TITLE=$(echo "$PAGE_CONTENT" | perl -ne 'print $1 if /"title":"([^"]+?)(?<!\\)"/')

# Extract channel name
CHANNEL=$(echo "$PAGE_CONTENT" | perl -ne 'print $1 if /"ownerChannelName":"([^"]+?)(?<!\\)"/')

# Extract duration in seconds
DURATION_SEC=$(echo "$PAGE_CONTENT" | perl -ne 'print $1 if /"lengthSeconds":"(\d+)"/')

# Format duration
if [ ! -z "$DURATION_SEC" ]; then
    HOURS=$((DURATION_SEC / 3600))
    MINUTES=$(((DURATION_SEC % 3600) / 60))
    SECONDS=$((DURATION_SEC % 60))
    
    if [ $HOURS -gt 0 ]; then
        DURATION=$(printf "%02d:%02d:%02d" $HOURS $MINUTES $SECONDS)
    else
        DURATION=$(printf "%02d:%02d" $MINUTES $SECONDS)
    fi
else
    DURATION="Unknown"
fi

# Extract description
DESCRIPTION=$(echo "$PAGE_CONTENT" | perl -0777 -ne '
    if (/"shortDescription":"(.*?)(?<!\\)"/s) {
        $desc = $1;
        $desc =~ s/\\n/\n/g;
        $desc =~ s/\\"/"/g;
        print $desc;
        exit;
    } elsif (/"description":{"simpleText":"(.*?)(?<!\\)"}/s) {
        $desc = $1;
        $desc =~ s/\\n/\n/g;
        $desc =~ s/\\"/"/g;
        print $desc;
        exit;
    } elsif (/"description":{"runs":\[(.*?)\]}/s) {
        $runs = $1;
        while ($runs =~ /"text":"(.*?)(?<!\\)"/gs) {
            print $1;
        }
        exit;
    }
')

# Output results in JSON format
cat <<EOF
{
    "video_id": "$VIDEO_ID",
    "title": "$TITLE",
    "channel": "$CHANNEL",
    "duration": "$DURATION",
    "duration_seconds": $DURATION_SEC,
    "description": $(echo "$DESCRIPTION" | jq -Rs .)
}
EOF

This script outputs the extracted metadata in JSON format, which can be easily parsed and integrated into other systems.

Handling Edge Cases and Potential Pitfalls

When extracting metadata from YouTube using regex patterns, several challenges may arise:

1. Escaped Characters

YouTube titles, descriptions, and other text fields may contain escaped characters like \" or \\. The Perl regex patterns used above include (?<!\\)" to ensure we don't match on escaped quotes.

2. Rate Limiting and IP Blocking

Excessive requests to YouTube may trigger rate limiting or IP blocking. Implement appropriate delays between requests and consider using a rotating proxy service for production applications.

3. Page Structure Changes

YouTube frequently updates its page structure. The patterns provided are current as of 2024, but may need adjustment if YouTube changes its frontend. Implement monitoring to detect extraction failures.

4. Private or Age-restricted Videos

Some videos require authentication or have restricted access. The extraction may fail or return incomplete data for these videos.

5. Internationalization

Videos in different languages may have special characters or different metadata structures. Ensure your processing handles UTF-8 encoding properly.

Integration with Trax Platform

For the Trax platform, this YouTube metadata extraction capability could be integrated in several ways:

1. Media File Enrichment

When a YouTube URL is provided as a source, automatically extract and store metadata in the media_files table, potentially using the JSONB column for flexible storage:

ALTER TABLE media_files ADD COLUMN metadata JSONB;

2. Transcription Context Enhancement

Use video metadata to improve transcription accuracy by providing context to the Whisper model, especially for technical terms mentioned in the video title or description.

3. FastAPI Endpoint

Add a dedicated endpoint in the FastAPI interface for metadata extraction:

@router.get("/api/v1/youtube/metadata/{video_id}")
async def get_youtube_metadata(video_id: str):
    # Call the extraction function
    metadata = await extract_youtube_metadata(video_id)
    return metadata

4. Background Processing

Implement the extraction as a background task to avoid blocking API responses:

@router.post("/api/v1/jobs/youtube")
async def create_youtube_job(video_id: str, background_tasks: BackgroundTasks):
    # Create job record
    job_id = create_job_record(video_id)
    
    # Schedule background tasks
    background_tasks.add_task(extract_metadata_and_process, video_id, job_id)
    
    return {"job_id": job_id, "status": "processing"}

Conclusion

Extracting metadata from YouTube using curl and regex patterns provides a lightweight, efficient approach to gathering contextual information about video content. The patterns and techniques outlined in this research are tailored to YouTube's current (2024) page structure and can be integrated into the Trax platform to enhance its media processing capabilities.

For production use, consider implementing robust error handling, rate limiting compliance, and regular monitoring of pattern effectiveness as YouTube's page structure evolves. Additionally, while this approach is efficient for moderate volumes, for very high-volume processing, you might consider YouTube's official API as an alternative, despite its limitations and quotas.

By incorporating this YouTube metadata extraction capability, the Trax platform can provide richer context for transcriptions and a more comprehensive media processing solution.


Generated by Task Master Research Command
Timestamp: 2025-08-30T11:51:55.698Z