Product · Extract API

Clean content extraction for LLMs — 5 layers deep

The Content Extraction Protocol (CEP) cascades through 5 techniques — from fast CSS selectors to headless rendering to OCR — selecting the right method for each URL automatically. The result: clean, token-budgeted content ready for any LLM context window.

Get API Key Free →API Reference

Extraction layers

~15ms

Typical (CSS)

~800ms

JS render path

90%

Token reduction

The 5-Layer Extraction Cascade

CEP automatically selects the right layer for each URL. Fast layers run first; slower layers only activate when needed. Average extraction time is 15–200ms for most web pages.

CSS Selector Extraction~5ms

Fast-path: if the page has semantic HTML (article, main, .content), CSS selectors extract clean content in milliseconds without rendering.

Readability Parsing~15ms

Mozilla Readability algorithm removes boilerplate (navigation, ads, footers) and extracts the main article body with high accuracy on most news and blog content.

Headless JS Rendering~800ms

For JavaScript-heavy SPAs, Chromium renders the page fully before extraction. Captures dynamically loaded content that pure HTML parsing misses.

PDF Extraction~200ms

PDFs, academic papers, and document URLs are handled natively — text extracted, tables preserved, images skipped unless OCR is requested.

Screenshot OCR~2,000ms

Last resort: if the page is an image-only document or has text encoded as graphics, OCR extracts the visible text. This is the slowest layer, used rarely.

QATBE Algorithm

Token-Budgeted Extraction

After extraction, QATBE (Query-Aware Token-Budgeted Extraction) scores each content segment by BM25 relevance to your query, then packs the highest-scoring segments into your token budget using a greedy knapsack algorithm.

60–90%

Token reduction vs. raw HTML

BM25

Query-relevance scoring per segment

4096

Default token budget (configurable)

Extract a URL

cURLPOST /v1/scrape

curl -X POST https://api.fetchium.com/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article",
    "query": "async rust patterns",
    "token_budget": 4096,
    "format": "markdown",
    "extract_citations": true
  }'

Search API

Search + extract in one call

vs Firecrawl

Compare extraction APIs

CEP Algorithm Docs

Deep technical reference

Clean content extraction for LLMs — 5 layers deep

The 5-Layer Extraction Cascade

Token-Budgeted Extraction

Extract a URL

Related