Reap the web.
curl_reap is one small Python library that puts three scraping superpowers behind a single friendly API: browser-grade TLS impersonation, self-healing selectors that survive markup changes, and a concurrent crawl engine.
| curl_cffi | Scrapy | Scrapling | curl_reap | |
|---|---|---|---|---|
| Real browser TLS / JA3 | yes | no | partial | yes |
| Parser built in | no | yes | yes | yes |
| Self-healing selectors | no | no | yes | yes |
| Concurrent crawl engine | no | yes | no | yes |
| AutoThrottle, retries, pipelines | no | yes | no | yes |
Installation
Requires Python 3.9 or newer. It pulls in curl_cffi, lxml, and cssselect.
pip install curl_reapQuick start
Fetch and parse
A one-shot fetch parses like parsel, but the request carries a genuine Chrome fingerprint at the TLS layer.
import curl_reap as reap
page = reap.get("https://quotes.toscrape.com", impersonate="chrome124")
print(page.css("span.text::text").getall())
print(page.css_first("small.author::text"))Crawl with concurrency
import curl_reap as reap
class Quotes(reap.Spider):
start_urls = ["https://quotes.toscrape.com"]
def parse(self, page):
for q in page.css("div.quote"):
yield {"text": q.css_first("span.text::text"),
"author": q.css_first("small.author::text")}
nxt = page.css_first("li.next a::attr(href)")
if nxt:
yield reap.Request("https://quotes.toscrape.com" + nxt, self.parse)
items = reap.run(Quotes, concurrency=8, throttle=True)
print(len(items), "items reaped")Transport and impersonation
Every request goes out with a real browser TLS/JA3 and HTTP2 fingerprint (powered by curl_cffi). This is what gets you past fingerprint-based bot detection that blocks stock Python clients.
One-shot GET on a shared impersonating session. reap.post and reap.fetch work the same way. Any extra keyword is passed straight to curl_cffi (params, data, json, proxies, cookies, and so on).
A reusable client that keeps cookies and connection pooling. Methods: .get(url, **kw), .post(url, **kw), .request(method, url, **kw), .close().
from curl_reap import Session
s = Session(impersonate="safari17_0", headers={"Accept-Language": "en"}, retries=3)
r = s.get("https://httpbin.org/headers")
print(r.status, r.ok, r.url)
print(r.json())
s.close()The Response object
A Response behaves like a page: .css / .xpath / .find_by_text pass through to a parser. Attributes: .status, .ok, .url, .headers, .text, .content, .meta. Methods: .selector(), .json().
Selectors and parsing
The parser is built on lxml with parsel-style ergonomics. CSS supports the ::text and ::attr(name) pseudo elements.
Select by CSS. Returns a SelectorList of child Selectors, or of strings when the query ends in ::text or ::attr(...).
page.css("h1::text").get() # first match, or None
page.css("a.item::attr(href)").getall() # every href as a list
page.css("div.card") # SelectorList of element Selectors
page.css_first("span.price::text", default="n/a")
# nested selection
for card in page.css("div.card"):
title = card.css_first(".title::text")
link = card.css_first("a::attr(href)")XPath, regex, and values
page.xpath("//div[@class='quote']/span/text()").getall()
page.re(r"price:\s*([\d.]+)")
el = page.css_first("a.author")
el.text # text content, stripped
el.attr("href") # one attribute
el.attrib # dict of all attributes
el.html # outer HTML of the elementSelectorList helpers
.get(default=None), .getall(), .text(), .attr(name), and .css(query) (mapped across every element in the list).
Self-healing selectors
The biggest maintenance cost in scraping is selectors breaking when a site renames a class or reshuffles its DOM. curl_reap can save a structural signature of an element once, then relocate it later by scoring every node against that signature.
Persist this element's signature (tag, classes, id, attributes, text, DOM path, position) to a small JSON store. Default store: .reap_selectors.json.
If the normal query finds nothing and auto_match=True, curl_reap relocates the element from the saved signature for that identifier.
import curl_reap as reap
# 1. remember the button once, from today's markup
page = reap.get("https://shop.example.com/item/42")
page.css_first("a.buy-btn").save("buy_button")
# 2. weeks later the class is renamed to "purchase-cta".
# the old selector misses, but auto_match relocates it:
later = reap.get("https://shop.example.com/item/99")
btn = later.css_first("a.buy-btn", auto_match=True, identifier="buy_button")
print(btn.attr("href")) # found anywayUnder the hood
Exposed for advanced use: reap.signature(element), reap.similarity(sig_a, sig_b) (returns 0..1), reap.save(id, element, storage), and reap.relocate(id, tree, storage, threshold=0.6). Matching weights tag, class set (Jaccard), id, attributes, text, and DOM path.
Finders
Find elements whose text contains (or equals, if partial=False) the given string.
Given one element, return others that are structurally similar. Great for grabbing every repeated card or row without writing a brittle class selector.
signin = page.find_by_text("Sign in", first=True).get()
first_card = page.css_first("div.card")
all_cards = page.find_similar(first_card) # every card like it
print(len(all_cards))Geocoding
Many scraped listings carry a name and an area but no coordinates (the lat/lon lives behind a wall, or simply is not in the markup). A naive single query fails often: the full marketing name is not in the gazetteer, and a bare district query collapses every listing onto one centroid. The Geocoder fixes that with a cascade and tells you the precision.
A cascading geocoder over OpenStreetMap Nominatim, cached and rate limited.
Tries the raw name, then the name with generic suffixes stripped (Serviced Apartments, Aparthotel, by <brand>, ...), then the district, then the city. Stops at the first hit and returns {lat, lon, precision, importance, display_name, query}. precision is "name", "district", or "city", so you keep the exact ones and drop coarse centroids.
from curl_reap import Geocoder
geo = Geocoder()
hit = geo.geocode(name="Chelsea Cloisters Serviced Apartments",
area="Kensington", city="London", country="United Kingdom")
# {'lat': 51.49.., 'lon': -0.17.., 'precision': 'name', ...}
if hit and hit["precision"] in ("name", "district"):
listing["lat"], listing["lon"] = hit["lat"], hit["lon"] # keep accurate ones onlyCrawl engine
A small concurrent engine. A Spider yields items (dicts) and more Request objects; the engine handles concurrency, dedup, retries, AutoThrottle, and pipelines.
Set start_urls and implement parse. Override start() if you need custom seed requests.
A pending fetch plus its parse callback. meta is carried onto the Response. Extra kwargs go to the transport.
Run a Spider (class or instance) to completion and return the scraped items. reap.Reaper(...) is the same with a .run() method and a .stats dict (requests, items, errors, dropped).
import curl_reap as reap
reaper = reap.Reaper(MySpider(), concurrency=12, throttle=True,
on_item=lambda it: print("got", it["title"]))
items = reaper.run()
print(reaper.stats) # {'requests': 40, 'items': 380, 'errors': 0, 'dropped': 4}AutoThrottle
With throttle=True, the engine watches response latency and adapts a per-request delay so the crawl stays polite and avoids IP bans, the way Scrapy's AutoThrottle does. Target delay is roughly average latency divided by your concurrency. Set delay= for a fixed floor or throttle=False to go full speed.
Pipelines
Each scraped item flows through a chain. A pipeline that returns None drops the item.
DedupPipeline(key=None)drops duplicates (by a field, or the whole item). Added automatically unless you passdedup=False.JsonLinesPipeline(path)streams items to a .jsonl file as they arrive.CsvPipeline(path)writes a CSV on close.- Subclass
Pipelineand implementopen/process/closefor your own.
from curl_reap import Pipeline, DedupPipeline, JsonLinesPipeline
class PriceToFloat(Pipeline):
def process(self, item):
item["price"] = float(item["price"].strip("$"))
return item # or return None to drop it
reap.run(MySpider, pipelines=[
DedupPipeline(key="url"),
PriceToFloat(),
JsonLinesPipeline("out.jsonl"),
])API reference
| Symbol | What it does |
|---|---|
reap.get / post / fetch | One-shot impersonating request, returns Response |
reap.Session | Reusable client (cookies, pooling, retries) |
reap.Response | Fetched page; css/xpath pass through, plus json() |
reap.Selector | css, css_first, xpath, re, find_by_text, find_similar, save, text, attr, html |
reap.SelectorList | get, getall, text, attr, css |
reap.Spider | start_urls + parse; the crawl unit |
reap.Request | url + callback + meta |
reap.run / reap.Reaper | Run a spider with concurrency, throttle, pipelines |
reap.Pipeline and friends | Dedup, JsonLines, Csv, or custom |
reap.signature / similarity / save / relocate | Self-healing selector internals |
Why not just use X
curl_cffi is the transport only: brilliant TLS impersonation, no parser, no crawl engine. curl_reap uses it underneath and adds the rest.
Scrapy is a heavyweight framework with great orchestration, but its default downloader is TLS-fingerprinted and blocked by modern bot walls. curl_reap brings the impersonation into the engine.
Scrapling pioneered self-healing selectors (the idea curl_reap borrows) and ships a stealth browser. curl_reap keeps the adaptive parsing and the lightweight footprint, without bundling a browser.
Legal and acceptable use
This is not legal advice. Web scraping legality depends on jurisdiction, what you scrape, and how. You are responsible for your use:
- Check
robots.txtand read the target site's Terms of Service. - Do not circumvent technical access controls. Looking like a browser is one thing; defeating a bot-detection challenge a site deployed to block you raises CFAA / DMCA exposure in the US and analogous statutes elsewhere.
- Handle personal data lawfully (GDPR, CCPA, or your local equivalent apply to you as the data controller).
- Respect copyright: extract facts and data, do not mirror copyrighted text or media wholesale.
- Throttle (
throttle=True) and identify yourself with a descriptive User-Agent so operators can reach you.
Provided under the MIT License, "as is", with no warranty and no liability. Full notice and details: LEGAL.md.