Clean content extraction for LLMs — 5 layers deep
The Content Extraction Protocol (CEP) cascades through 5 techniques — from fast CSS selectors to headless rendering to OCR — selecting the right method for each URL automatically. The result: clean, token-budgeted content ready for any LLM context window.
The 5-Layer Extraction Cascade
CEP automatically selects the right layer for each URL. Fast layers run first; slower layers only activate when needed. Average extraction time is 15–200ms for most web pages.
Fast-path: if the page has semantic HTML (article, main, .content), CSS selectors extract clean content in milliseconds without rendering.
Mozilla Readability algorithm removes boilerplate (navigation, ads, footers) and extracts the main article body with high accuracy on most news and blog content.
For JavaScript-heavy SPAs, Chromium renders the page fully before extraction. Captures dynamically loaded content that pure HTML parsing misses.
PDFs, academic papers, and document URLs are handled natively — text extracted, tables preserved, images skipped unless OCR is requested.
Last resort: if the page is an image-only document or has text encoded as graphics, OCR extracts the visible text. This is the slowest layer, used rarely.
Token-Budgeted Extraction
After extraction, QATBE (Query-Aware Token-Budgeted Extraction) scores each content segment by BM25 relevance to your query, then packs the highest-scoring segments into your token budget using a greedy knapsack algorithm.
Extract a URL
curl -X POST https://api.fetchium.com/v1/scrape \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article",
"query": "async rust patterns",
"token_budget": 4096,
"format": "markdown",
"extract_citations": true
}'