Product · Extract API

Clean content extraction for LLMs — 5 layers deep

The Content Extraction Protocol (CEP) cascades through 5 techniques — from fast CSS selectors to headless rendering to OCR — selecting the right method for each URL automatically. The result: clean, token-budgeted content ready for any LLM context window.

5
Extraction layers
~15ms
Typical (CSS)
~800ms
JS render path
90%
Token reduction

The 5-Layer Extraction Cascade

CEP automatically selects the right layer for each URL. Fast layers run first; slower layers only activate when needed. Average extraction time is 15–200ms for most web pages.

1
CSS Selector Extraction~5ms

Fast-path: if the page has semantic HTML (article, main, .content), CSS selectors extract clean content in milliseconds without rendering.

2
Readability Parsing~15ms

Mozilla Readability algorithm removes boilerplate (navigation, ads, footers) and extracts the main article body with high accuracy on most news and blog content.

3
Headless JS Rendering~800ms

For JavaScript-heavy SPAs, Chromium renders the page fully before extraction. Captures dynamically loaded content that pure HTML parsing misses.

4
PDF Extraction~200ms

PDFs, academic papers, and document URLs are handled natively — text extracted, tables preserved, images skipped unless OCR is requested.

5
Screenshot OCR~2,000ms

Last resort: if the page is an image-only document or has text encoded as graphics, OCR extracts the visible text. This is the slowest layer, used rarely.

QATBE Algorithm

Token-Budgeted Extraction

After extraction, QATBE (Query-Aware Token-Budgeted Extraction) scores each content segment by BM25 relevance to your query, then packs the highest-scoring segments into your token budget using a greedy knapsack algorithm.

60–90%
Token reduction vs. raw HTML
BM25
Query-relevance scoring per segment
4096
Default token budget (configurable)

Extract a URL

cURLPOST /v1/scrape
curl -X POST https://api.fetchium.com/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article",
    "query": "async rust patterns",
    "token_budget": 4096,
    "format": "markdown",
    "extract_citations": true
  }'

Related