Algorithmic first - AI only when stuck

It parses the file.
It knows when it can't.

terbium reconstructs a document's structure geometrically, detecting columns, rows, and 2-D matrices from the raw position of the text. It scores its own confidence on every record and only reaches for an AI model when it is genuinely stuck. Give it no key and it will not fail silently or burn tokens, it tells you so, by name, and which model tier it recommends.

What one vendor PDF actually hides catalogue.pdf
192
Pages
964
Unique SKUs
505
Dimension rows
1157
Images
Confidence, not guesswork

It asks for help by name.

Every run ends with a verdict, not a stack trace. When the geometry is unambiguous, terbium finishes cold with zero tokens spent. When a handful of pages are genuinely hard, it says exactly which ones, why, and what to hand it next.

terbium - run summary
terbium: 47/52 products parsed confidently.
5 pages have ambiguous matrices (orphan SKUs, uncertain column alignment).
-> set ANTHROPIC_API_KEY or pass ai=terbium.AI(...)   ·   recommended tier: Opus
Four formats, one model of the page

Every file, reduced to columns, rows, and matrices.

PDFGeometry engine

Word-level geometry

Reads the raw x/y position of every word and rebuilds the columns, rows, and 2-D matrices the page never labelled. This is where the full structural engine runs.

PPTXNative structure

Native slides, tables, images

Walks the real slide tree, pulling text frames, native tables, and embedded images straight from the deck's own structure.

XLSXNative structure

Cells, merged ranges, shape

Resolves merged ranges and detects whether a sheet is wide or long, so tables land in the right orientation without a guess.

CSVNative structure

Delimiter, encoding, types

Infers the delimiter, the encoding, and the type of each column before it reads a single row, so messy exports parse cleanly.

PDF gets the full geometry engine because a PDF throws its structure away, terbium has to rebuild it. PPTX, XLSX, and CSV already carry native structure, so terbium leans on it and parses them cleanly and cheaply.

The detector is content-agnostic. Any column-aligned table, a financial grid, a spec sheet, a schedule, or a furniture size x finish matrix, reconstructs the same way. terbium is a general parser first; the furniture schema is an opt-in interpretation on top, not the limit.

Not every PDF is a matrix. A lookbook, a grid of photos with a name under each, is reconstructed as a label grid: one record per product, grouped by collection. And when a page is image-only, terbium does not return nothing, it reports exactly which pages need the vision lane.

Not just the text

Pull out the product images, named by product.

terbium extracts every product photo losslessly and names each file after the product it sits beneath, dropping icons, thin banners, and logos that repeat across pages. One call, no AI key needed.

pythonexport_images
import terbium
manifest = terbium.export_images("lookbook.pdf", "out/")
# out/Kyoto_Bedside_Table.jpeg, out/Meadow_Bedside_Table.jpeg, ...

# or from the shell, with a manifest.csv alongside the photos
$ terbium lookbook.pdf --images out/

Per image: product, collection, page, format, pixel size, colorspace, effective dpi, dominant colour, and position, written to a manifest.csv.

Quickstart

Parse first. Add a key only when it asks.

Install, point it at a file, and read doc.stats. Wire up an AI key when, and only when, terbium tells you a page is genuinely hard.

shellpypi
pip install terbium-parse
pythonexample.py
import terbium

doc = terbium.parse("Furniture Catalogue.pdf")          # algorithmic only
print(doc.stats)                                          # {confident: 47, ambiguous: 5}

doc = terbium.parse("catalogue.pdf", schema="furniture",
                    ai=terbium.AI(anthropic_key=...))     # AI on hard pages only
When it does call AI, it routes

The hard page gets the strong model. Nothing else does.

terbium matches the difficulty of a page to the cheapest model that can actually solve it, so you never pay Opus prices for a page Haiku could clear.

Trivial
Haiku

A stray label or an obvious cell. Cleared for pennies.

Moderate
Sonnet

A table whose alignment is plausible but not certain.

Hard
Opus

Ambiguous matrices and orphan SKUs. The real work.

Images, material icons and finish swatches, are read by a vision model and folded back into the record.