Reap the web.

curl_reap is one small Python library that puts three scraping superpowers behind a single friendly API: browser-grade TLS impersonation, self-healing selectors that survive markup changes, and a concurrent crawl engine.

Quick start Star on GitHub

	curl_cffi	Scrapy	Scrapling	curl_reap
Real browser TLS / JA3	yes	no	partial	yes
Parser built in	no	yes	yes	yes
Self-healing selectors	no	no	yes	yes
Concurrent crawl engine	no	yes	no	yes
AutoThrottle, retries, pipelines	no	yes	no	yes

Installation

Requires Python 3.9 or newer. It pulls in curl_cffi, lxml, and cssselect.

pip install curl_reap

Quick start

Fetch and parse

A one-shot fetch parses like parsel, but the request carries a genuine Chrome fingerprint at the TLS layer.

import curl_reap as reap

page = reap.get("https://quotes.toscrape.com", impersonate="chrome124")
print(page.css("span.text::text").getall())
print(page.css_first("small.author::text"))

Crawl with concurrency

import curl_reap as reap

class Quotes(reap.Spider):
    start_urls = ["https://quotes.toscrape.com"]

    def parse(self, page):
        for q in page.css("div.quote"):
            yield {"text": q.css_first("span.text::text"),
                   "author": q.css_first("small.author::text")}
        nxt = page.css_first("li.next a::attr(href)")
        if nxt:
            yield reap.Request("https://quotes.toscrape.com" + nxt, self.parse)

items = reap.run(Quotes, concurrency=8, throttle=True)
print(len(items), "items reaped")

Transport and impersonation

Every request goes out with a real browser TLS/JA3 and HTTP2 fingerprint (powered by curl_cffi). This is what gets you past fingerprint-based bot detection that blocks stock Python clients.

reap.get(url, impersonate="chrome124", headers=None, timeout=30, retries=2, **kw) -> Response

One-shot GET on a shared impersonating session. reap.post and reap.fetch work the same way. Any extra keyword is passed straight to curl_cffi (params, data, json, proxies, cookies, and so on).

reap.Session(impersonate="chrome124", headers=None, timeout=30, retries=2, proxies=None, **kw)

A reusable client that keeps cookies and connection pooling. Methods: .get(url, **kw), .post(url, **kw), .request(method, url, **kw), .close().

from curl_reap import Session

s = Session(impersonate="safari17_0", headers={"Accept-Language": "en"}, retries=3)
r = s.get("https://httpbin.org/headers")
print(r.status, r.ok, r.url)
print(r.json())
s.close()

The Response object

A Response behaves like a page: .css / .xpath / .find_by_text pass through to a parser. Attributes: .status, .ok, .url, .headers, .text, .content, .meta. Methods: .selector(), .json().

Available impersonate targets follow curl_cffi: chrome99 through chrome124, edge99/101, safari15/17, and more. Pick a recent one (chrome124) for the broadest coverage.

Selectors and parsing

The parser is built on lxml with parsel-style ergonomics. CSS supports the ::text and ::attr(name) pseudo elements.

Selector.css(query, auto_match=False, identifier=None, storage=None) -> SelectorList

Select by CSS. Returns a SelectorList of child Selectors, or of strings when the query ends in ::text or ::attr(...).

page.css("h1::text").get()              # first match, or None
page.css("a.item::attr(href)").getall() # every href as a list
page.css("div.card")                    # SelectorList of element Selectors
page.css_first("span.price::text", default="n/a")

# nested selection
for card in page.css("div.card"):
    title = card.css_first(".title::text")
    link  = card.css_first("a::attr(href)")

XPath, regex, and values

page.xpath("//div[@class='quote']/span/text()").getall()
page.re(r"price:\s*([\d.]+)")

el = page.css_first("a.author")
el.text          # text content, stripped
el.attr("href")  # one attribute
el.attrib        # dict of all attributes
el.html          # outer HTML of the element

SelectorList helpers

.get(default=None), .getall(), .text(), .attr(name), and .css(query) (mapped across every element in the list).

Self-healing selectors

The biggest maintenance cost in scraping is selectors breaking when a site renames a class or reshuffles its DOM. curl_reap can save a structural signature of an element once, then relocate it later by scoring every node against that signature.

Selector.save(identifier, storage=None)

Persist this element's signature (tag, classes, id, attributes, text, DOM path, position) to a small JSON store. Default store: .reap_selectors.json.

Selector.css(query, auto_match=True, identifier="...", storage=None)

If the normal query finds nothing and auto_match=True, curl_reap relocates the element from the saved signature for that identifier.

import curl_reap as reap

# 1. remember the button once, from today's markup
page = reap.get("https://shop.example.com/item/42")
page.css_first("a.buy-btn").save("buy_button")

# 2. weeks later the class is renamed to "purchase-cta".
#    the old selector misses, but auto_match relocates it:
later = reap.get("https://shop.example.com/item/99")
btn = later.css_first("a.buy-btn", auto_match=True, identifier="buy_button")
print(btn.attr("href"))   # found anyway

Under the hood

Exposed for advanced use: reap.signature(element), reap.similarity(sig_a, sig_b) (returns 0..1), reap.save(id, element, storage), and reap.relocate(id, tree, storage, threshold=0.6). Matching weights tag, class set (Jaccard), id, attributes, text, and DOM path.

Finders

Selector.find_by_text(text, partial=True, first=False) -> SelectorList

Find elements whose text contains (or equals, if partial=False) the given string.

Selector.find_similar(sample, threshold=0.6, limit=None) -> SelectorList

Given one element, return others that are structurally similar. Great for grabbing every repeated card or row without writing a brittle class selector.

signin = page.find_by_text("Sign in", first=True).get()

first_card = page.css_first("div.card")
all_cards  = page.find_similar(first_card)   # every card like it
print(len(all_cards))

Geocoding

Many scraped listings carry a name and an area but no coordinates (the lat/lon lives behind a wall, or simply is not in the markup). A naive single query fails often: the full marketing name is not in the gazetteer, and a bare district query collapses every listing onto one centroid. The Geocoder fixes that with a cascade and tells you the precision.

reap.Geocoder(min_interval=1.1, cache=".reap_geocode.json", min_importance=0.0)

A cascading geocoder over OpenStreetMap Nominatim, cached and rate limited.

Geocoder.geocode(name=None, area=None, city=None, country=None) -> dict | None

Tries the raw name, then the name with generic suffixes stripped (Serviced Apartments, Aparthotel, by <brand>, ...), then the district, then the city. Stops at the first hit and returns {lat, lon, precision, importance, display_name, query}. precision is "name", "district", or "city", so you keep the exact ones and drop coarse centroids.

from curl_reap import Geocoder

geo = Geocoder()
hit = geo.geocode(name="Chelsea Cloisters Serviced Apartments",
                  area="Kensington", city="London", country="United Kingdom")
# {'lat': 51.49.., 'lon': -0.17.., 'precision': 'name', ...}

if hit and hit["precision"] in ("name", "district"):
    listing["lat"], listing["lon"] = hit["lat"], hit["lon"]   # keep accurate ones only

Be a good citizen: Nominatim asks for a real User-Agent and a max of one request per second. The Geocoder sets both by default and caches every lookup to disk, so repeat runs are instant.

Crawl engine

A small concurrent engine. A Spider yields items (dicts) and more Request objects; the engine handles concurrency, dedup, retries, AutoThrottle, and pipelines.

class Spider: name, start_urls, def parse(self, page)

Set start_urls and implement parse. Override start() if you need custom seed requests.

Request(url, callback=None, method="GET", meta=None, **kw)

A pending fetch plus its parse callback. meta is carried onto the Response. Extra kwargs go to the transport.

reap.run(spider, concurrency=8, retries=2, throttle=True, delay=0.0, impersonate="chrome124", pipelines=None, dedup=True, on_item=None, max_pages=None) -> list

Run a Spider (class or instance) to completion and return the scraped items. reap.Reaper(...) is the same with a .run() method and a .stats dict (requests, items, errors, dropped).

import curl_reap as reap

reaper = reap.Reaper(MySpider(), concurrency=12, throttle=True,
                     on_item=lambda it: print("got", it["title"]))
items = reaper.run()
print(reaper.stats)   # {'requests': 40, 'items': 380, 'errors': 0, 'dropped': 4}

AutoThrottle

With throttle=True, the engine watches response latency and adapts a per-request delay so the crawl stays polite and avoids IP bans, the way Scrapy's AutoThrottle does. Target delay is roughly average latency divided by your concurrency. Set delay= for a fixed floor or throttle=False to go full speed.

Pipelines

Each scraped item flows through a chain. A pipeline that returns None drops the item.

DedupPipeline(key=None) drops duplicates (by a field, or the whole item). Added automatically unless you pass dedup=False.
JsonLinesPipeline(path) streams items to a .jsonl file as they arrive.
CsvPipeline(path) writes a CSV on close.
Subclass Pipeline and implement open / process / close for your own.

from curl_reap import Pipeline, DedupPipeline, JsonLinesPipeline

class PriceToFloat(Pipeline):
    def process(self, item):
        item["price"] = float(item["price"].strip("$"))
        return item   # or return None to drop it

reap.run(MySpider, pipelines=[
    DedupPipeline(key="url"),
    PriceToFloat(),
    JsonLinesPipeline("out.jsonl"),
])

API reference

Symbol	What it does
`reap.get / post / fetch`	One-shot impersonating request, returns Response
`reap.Session`	Reusable client (cookies, pooling, retries)
`reap.Response`	Fetched page; css/xpath pass through, plus json()
`reap.Selector`	css, css_first, xpath, re, find_by_text, find_similar, save, text, attr, html
`reap.SelectorList`	get, getall, text, attr, css
`reap.Spider`	start_urls + parse; the crawl unit
`reap.Request`	url + callback + meta
`reap.run / reap.Reaper`	Run a spider with concurrency, throttle, pipelines
`reap.Pipeline and friends`	Dedup, JsonLines, Csv, or custom
`reap.signature / similarity / save / relocate`	Self-healing selector internals

Why not just use X

curl_cffi is the transport only: brilliant TLS impersonation, no parser, no crawl engine. curl_reap uses it underneath and adds the rest.

Scrapy is a heavyweight framework with great orchestration, but its default downloader is TLS-fingerprinted and blocked by modern bot walls. curl_reap brings the impersonation into the engine.

Scrapling pioneered self-healing selectors (the idea curl_reap borrows) and ships a stealth browser. curl_reap keeps the adaptive parsing and the lightweight footprint, without bundling a browser.

Legal and acceptable use

⚠ Read this before you scrape. curl_reap impersonates a real browser's TLS, which is what a normal browser does. It does not solve CAPTCHAs, bypass logins or paywalls, or defeat anti-bot services (Cloudflare, DataDome, PerimeterX, Akamai Bot Manager). If a site has deployed one of those and is actively blocking you, that block is the line the maintainers expect you to respect.

This is not legal advice. Web scraping legality depends on jurisdiction, what you scrape, and how. You are responsible for your use:

Check robots.txt and read the target site's Terms of Service.
Do not circumvent technical access controls. Looking like a browser is one thing; defeating a bot-detection challenge a site deployed to block you raises CFAA / DMCA exposure in the US and analogous statutes elsewhere.
Handle personal data lawfully (GDPR, CCPA, or your local equivalent apply to you as the data controller).
Respect copyright: extract facts and data, do not mirror copyrighted text or media wholesale.
Throttle (throttle=True) and identify yourself with a descriptive User-Agent so operators can reach you.

Provided under the MIT License, "as is", with no warranty and no liability. Full notice and details: LEGAL.md.