curl_reap PyPI GitHub

Reap the web.

curl_reap is one small Python library that puts three scraping superpowers behind a single friendly API: browser-grade TLS impersonation, self-healing selectors that survive markup changes, and a concurrent crawl engine.

curl_cffiScrapyScraplingcurl_reap
Real browser TLS / JA3yesnopartialyes
Parser built innoyesyesyes
Self-healing selectorsnonoyesyes
Concurrent crawl enginenoyesnoyes
AutoThrottle, retries, pipelinesnoyesnoyes

Installation

Requires Python 3.9 or newer. It pulls in curl_cffi, lxml, and cssselect.

pip install curl_reap

Quick start

Fetch and parse

A one-shot fetch parses like parsel, but the request carries a genuine Chrome fingerprint at the TLS layer.

import curl_reap as reap

page = reap.get("https://quotes.toscrape.com", impersonate="chrome124")
print(page.css("span.text::text").getall())
print(page.css_first("small.author::text"))

Crawl with concurrency

import curl_reap as reap

class Quotes(reap.Spider):
    start_urls = ["https://quotes.toscrape.com"]

    def parse(self, page):
        for q in page.css("div.quote"):
            yield {"text": q.css_first("span.text::text"),
                   "author": q.css_first("small.author::text")}
        nxt = page.css_first("li.next a::attr(href)")
        if nxt:
            yield reap.Request("https://quotes.toscrape.com" + nxt, self.parse)

items = reap.run(Quotes, concurrency=8, throttle=True)
print(len(items), "items reaped")

Transport and impersonation

Every request goes out with a real browser TLS/JA3 and HTTP2 fingerprint (powered by curl_cffi). This is what gets you past fingerprint-based bot detection that blocks stock Python clients.

reap.get(url, impersonate="chrome124", headers=None, timeout=30, retries=2, **kw) -> Response

One-shot GET on a shared impersonating session. reap.post and reap.fetch work the same way. Any extra keyword is passed straight to curl_cffi (params, data, json, proxies, cookies, and so on).

reap.Session(impersonate="chrome124", headers=None, timeout=30, retries=2, proxies=None, **kw)

A reusable client that keeps cookies and connection pooling. Methods: .get(url, **kw), .post(url, **kw), .request(method, url, **kw), .close().

from curl_reap import Session

s = Session(impersonate="safari17_0", headers={"Accept-Language": "en"}, retries=3)
r = s.get("https://httpbin.org/headers")
print(r.status, r.ok, r.url)
print(r.json())
s.close()

The Response object

A Response behaves like a page: .css / .xpath / .find_by_text pass through to a parser. Attributes: .status, .ok, .url, .headers, .text, .content, .meta. Methods: .selector(), .json().

Available impersonate targets follow curl_cffi: chrome99 through chrome124, edge99/101, safari15/17, and more. Pick a recent one (chrome124) for the broadest coverage.

Selectors and parsing

The parser is built on lxml with parsel-style ergonomics. CSS supports the ::text and ::attr(name) pseudo elements.

Selector.css(query, auto_match=False, identifier=None, storage=None) -> SelectorList

Select by CSS. Returns a SelectorList of child Selectors, or of strings when the query ends in ::text or ::attr(...).

page.css("h1::text").get()              # first match, or None
page.css("a.item::attr(href)").getall() # every href as a list
page.css("div.card")                    # SelectorList of element Selectors
page.css_first("span.price::text", default="n/a")

# nested selection
for card in page.css("div.card"):
    title = card.css_first(".title::text")
    link  = card.css_first("a::attr(href)")

XPath, regex, and values

page.xpath("//div[@class='quote']/span/text()").getall()
page.re(r"price:\s*([\d.]+)")

el = page.css_first("a.author")
el.text          # text content, stripped
el.attr("href")  # one attribute
el.attrib        # dict of all attributes
el.html          # outer HTML of the element

SelectorList helpers

.get(default=None), .getall(), .text(), .attr(name), and .css(query) (mapped across every element in the list).

Self-healing selectors

The biggest maintenance cost in scraping is selectors breaking when a site renames a class or reshuffles its DOM. curl_reap can save a structural signature of an element once, then relocate it later by scoring every node against that signature.

Selector.save(identifier, storage=None)

Persist this element's signature (tag, classes, id, attributes, text, DOM path, position) to a small JSON store. Default store: .reap_selectors.json.

Selector.css(query, auto_match=True, identifier="...", storage=None)

If the normal query finds nothing and auto_match=True, curl_reap relocates the element from the saved signature for that identifier.

import curl_reap as reap

# 1. remember the button once, from today's markup
page = reap.get("https://shop.example.com/item/42")
page.css_first("a.buy-btn").save("buy_button")

# 2. weeks later the class is renamed to "purchase-cta".
#    the old selector misses, but auto_match relocates it:
later = reap.get("https://shop.example.com/item/99")
btn = later.css_first("a.buy-btn", auto_match=True, identifier="buy_button")
print(btn.attr("href"))   # found anyway

Under the hood

Exposed for advanced use: reap.signature(element), reap.similarity(sig_a, sig_b) (returns 0..1), reap.save(id, element, storage), and reap.relocate(id, tree, storage, threshold=0.6). Matching weights tag, class set (Jaccard), id, attributes, text, and DOM path.

Finders

Selector.find_by_text(text, partial=True, first=False) -> SelectorList

Find elements whose text contains (or equals, if partial=False) the given string.

Selector.find_similar(sample, threshold=0.6, limit=None) -> SelectorList

Given one element, return others that are structurally similar. Great for grabbing every repeated card or row without writing a brittle class selector.

signin = page.find_by_text("Sign in", first=True).get()

first_card = page.css_first("div.card")
all_cards  = page.find_similar(first_card)   # every card like it
print(len(all_cards))

Geocoding

Many scraped listings carry a name and an area but no coordinates (the lat/lon lives behind a wall, or simply is not in the markup). A naive single query fails often: the full marketing name is not in the gazetteer, and a bare district query collapses every listing onto one centroid. The Geocoder fixes that with a cascade and tells you the precision.

reap.Geocoder(min_interval=1.1, cache=".reap_geocode.json", min_importance=0.0)

A cascading geocoder over OpenStreetMap Nominatim, cached and rate limited.

Geocoder.geocode(name=None, area=None, city=None, country=None) -> dict | None

Tries the raw name, then the name with generic suffixes stripped (Serviced Apartments, Aparthotel, by <brand>, ...), then the district, then the city. Stops at the first hit and returns {lat, lon, precision, importance, display_name, query}. precision is "name", "district", or "city", so you keep the exact ones and drop coarse centroids.

from curl_reap import Geocoder

geo = Geocoder()
hit = geo.geocode(name="Chelsea Cloisters Serviced Apartments",
                  area="Kensington", city="London", country="United Kingdom")
# {'lat': 51.49.., 'lon': -0.17.., 'precision': 'name', ...}

if hit and hit["precision"] in ("name", "district"):
    listing["lat"], listing["lon"] = hit["lat"], hit["lon"]   # keep accurate ones only
Be a good citizen: Nominatim asks for a real User-Agent and a max of one request per second. The Geocoder sets both by default and caches every lookup to disk, so repeat runs are instant.

Crawl engine

A small concurrent engine. A Spider yields items (dicts) and more Request objects; the engine handles concurrency, dedup, retries, AutoThrottle, and pipelines.

class Spider: name, start_urls, def parse(self, page)

Set start_urls and implement parse. Override start() if you need custom seed requests.

Request(url, callback=None, method="GET", meta=None, **kw)

A pending fetch plus its parse callback. meta is carried onto the Response. Extra kwargs go to the transport.

reap.run(spider, concurrency=8, retries=2, throttle=True, delay=0.0, impersonate="chrome124", pipelines=None, dedup=True, on_item=None, max_pages=None) -> list

Run a Spider (class or instance) to completion and return the scraped items. reap.Reaper(...) is the same with a .run() method and a .stats dict (requests, items, errors, dropped).

import curl_reap as reap

reaper = reap.Reaper(MySpider(), concurrency=12, throttle=True,
                     on_item=lambda it: print("got", it["title"]))
items = reaper.run()
print(reaper.stats)   # {'requests': 40, 'items': 380, 'errors': 0, 'dropped': 4}

AutoThrottle

With throttle=True, the engine watches response latency and adapts a per-request delay so the crawl stays polite and avoids IP bans, the way Scrapy's AutoThrottle does. Target delay is roughly average latency divided by your concurrency. Set delay= for a fixed floor or throttle=False to go full speed.

Pipelines

Each scraped item flows through a chain. A pipeline that returns None drops the item.

  • DedupPipeline(key=None) drops duplicates (by a field, or the whole item). Added automatically unless you pass dedup=False.
  • JsonLinesPipeline(path) streams items to a .jsonl file as they arrive.
  • CsvPipeline(path) writes a CSV on close.
  • Subclass Pipeline and implement open / process / close for your own.
from curl_reap import Pipeline, DedupPipeline, JsonLinesPipeline

class PriceToFloat(Pipeline):
    def process(self, item):
        item["price"] = float(item["price"].strip("$"))
        return item   # or return None to drop it

reap.run(MySpider, pipelines=[
    DedupPipeline(key="url"),
    PriceToFloat(),
    JsonLinesPipeline("out.jsonl"),
])

API reference

SymbolWhat it does
reap.get / post / fetchOne-shot impersonating request, returns Response
reap.SessionReusable client (cookies, pooling, retries)
reap.ResponseFetched page; css/xpath pass through, plus json()
reap.Selectorcss, css_first, xpath, re, find_by_text, find_similar, save, text, attr, html
reap.SelectorListget, getall, text, attr, css
reap.Spiderstart_urls + parse; the crawl unit
reap.Requesturl + callback + meta
reap.run / reap.ReaperRun a spider with concurrency, throttle, pipelines
reap.Pipeline and friendsDedup, JsonLines, Csv, or custom
reap.signature / similarity / save / relocateSelf-healing selector internals

Why not just use X

curl_cffi is the transport only: brilliant TLS impersonation, no parser, no crawl engine. curl_reap uses it underneath and adds the rest.

Scrapy is a heavyweight framework with great orchestration, but its default downloader is TLS-fingerprinted and blocked by modern bot walls. curl_reap brings the impersonation into the engine.

Scrapling pioneered self-healing selectors (the idea curl_reap borrows) and ships a stealth browser. curl_reap keeps the adaptive parsing and the lightweight footprint, without bundling a browser.

Legal and acceptable use

This is not legal advice. Web scraping legality depends on jurisdiction, what you scrape, and how. You are responsible for your use:

  • Check robots.txt and read the target site's Terms of Service.
  • Do not circumvent technical access controls. Looking like a browser is one thing; defeating a bot-detection challenge a site deployed to block you raises CFAA / DMCA exposure in the US and analogous statutes elsewhere.
  • Handle personal data lawfully (GDPR, CCPA, or your local equivalent apply to you as the data controller).
  • Respect copyright: extract facts and data, do not mirror copyrighted text or media wholesale.
  • Throttle (throttle=True) and identify yourself with a descriptive User-Agent so operators can reach you.

Provided under the MIT License, "as is", with no warranty and no liability. Full notice and details: LEGAL.md.