Introducing kreuzcrawl v0.3.0

rust dev.to

kreuzcrawl began as a Rust core with bindings for ten languages. v0.3.0 ships fourteen, adds a tiered WAF-aware dispatch engine, cuts peak streaming memory from ~2.5 GB to ~20 MB, and enables SSRF defense across every outbound call path by default. It is the first release we consider API-stable.

This post covers what changed, why each decision was made, and what the harder engineering problems looked like from the inside.

At a glance

Area v0.2.0 v0.3.0
Language bindings 10 14 (+Dart, Kotlin/Android, Swift, Zig)
Peak streaming memory ~2.5 GB ~20 MB
SSRF protection opt-in on by default
Dispatch model static HTTP / bypass / browser tiered, signal-driven escalation
WAF fingerprints 35 across 8 vendors
Fingerprint hot-reload lock-free (ArcSwap), 500 ms debounce
MCP tools partial 1:1 with CLI, safety-annotated
CLI subcommands scrape, crawl + batch-scrape, batch-crawl, download, citations
Robots / sitemap parsers engine-internal public modules
API stability preview stable

Four new language bindings

v0.2.0 shipped Rust, Python, Node.js, Ruby, Go, Java, C#, PHP, Elixir, and WebAssembly.
v0.3.0 adds Dart, Kotlin/Android, Swift, and Zig — bringing the total to fourteen.

None of the per-language glue is written by hand. Every binding is generated from the Rust core by alef, our polyglot binding generator.
The Dart and Kotlin/Android packages bind through the C FFI layer (kreuzcrawl-ffi) via dart:ffi and JNI respectively. Swift binds through clang. Zig uses @cImport against the same C header.

The generation pipeline also hardened in this release: the Docker publish matrix now builds each architecture natively rather than via QEMU emulation, the Dart build no longer requires the Flutter SDK for pub.dev publishes, Swift artifactbundle checksums are injected automatically, and the Elixir/PHP/Ruby releases preserve their lock files through the source-publish step.

=== "Python"

```sh
pip install kreuzcrawl
```
Enter fullscreen mode Exit fullscreen mode

=== "Node.js"

```sh
npm install @xberg/kreuzcrawl
```
Enter fullscreen mode Exit fullscreen mode

=== "Rust"

```sh
cargo add kreuzcrawl
```
Enter fullscreen mode Exit fullscreen mode

=== "Go"

```sh
go get github.com/xberg-io/kreuzcrawl/packages/go
```
Enter fullscreen mode Exit fullscreen mode

=== "Java"

```xml
<dependency>
  <groupId>io.xberg.kreuzcrawl</groupId>
  <artifactId>kreuzcrawl</artifactId>
  <version>0.3.0</version>
</dependency>
```
Enter fullscreen mode Exit fullscreen mode

=== "Kotlin (Android)"

```groovy
implementation("io.xberg.kreuzcrawl.android:kreuzcrawl-android:0.3.0")
```
Enter fullscreen mode Exit fullscreen mode

=== "C#"

```sh
dotnet add package Kreuzcrawl
```
Enter fullscreen mode Exit fullscreen mode

=== "Ruby"

```sh
gem install kreuzcrawl
```
Enter fullscreen mode Exit fullscreen mode

=== "PHP"

```sh
composer require xberg-io/kreuzcrawl
```
Enter fullscreen mode Exit fullscreen mode

=== "Elixir"

```elixir
{:kreuzcrawl, "~> 0.3"}
```
Enter fullscreen mode Exit fullscreen mode

=== "Dart"

```sh
dart pub add kreuzcrawl
```
Enter fullscreen mode Exit fullscreen mode

=== "Swift"

```swift
// Package.swift
.package(url: "https://github.com/xberg-io/kreuzcrawl", from: "0.3.0")
```
Enter fullscreen mode Exit fullscreen mode

=== "Zig"

```sh
zig fetch --save https://github.com/xberg-io/kreuzcrawl/archive/v0.3.0.tar.gz
```
Enter fullscreen mode Exit fullscreen mode

=== "WebAssembly"

```sh
npm install @xberg/kreuzcrawl-wasm
```
Enter fullscreen mode Exit fullscreen mode

Memory-bounded streaming

crawl_stream() and batch_crawl_stream() previously accumulated every page result in memory before the caller received any of them. On a large crawl — tens of thousands of pages, each carrying extracted text, metadata, links, and images — the peak working set reached approximately 2.5 GB.

The fix is a change in ownership: each page result is moved into CrawlEvent::Page and emitted immediately. The caller receives it, processes it, and drops it. The engine never holds more than the current in-flight pages, bounded by the concurrency setting.

// The event type (unchanged externally; behavior changed internally)
pub enum CrawlEvent {
    Page { result: Box<CrawlPageResult> }, // (1)
    Error { url: String, error: String },
    Complete { pages_crawled: usize },
}
Enter fullscreen mode Exit fullscreen mode
  1. CrawlPageResult is boxed, moved into the variant, and dropped when the caller's loop moves past it. The engine holds no reference after the send.
# Python — pages are processed and released one at a time
from kreuzcrawl import crawl_stream

async for event in crawl_stream(engine, "https://example.com"):
    if event.type == "page":
        process(event)  # event is dropped after this scope
Enter fullscreen mode Exit fullscreen mode

Peak working set on a 10,000-page crawl with default concurrency (16): ~20 MB.

The non-streaming crawl() is unchanged — it accumulates by contract, because callers need the complete CrawlResult. The two code paths are kept separate. Merging them would push the accumulation pattern onto callers, which is the same problem moved one level up.

!!! tip "Choosing between crawl() and crawl_stream()"
Use crawl() when you need the full result set in memory. Use crawl_stream() for
large crawls, progress tracking, or when you process results one at a time. The memory
difference is significant at scale.

SSRF defense on by default

Web crawlers take URLs as input and make HTTP requests — the exact primitive an attacker needs to reach internal services. Every path that accepts a URL now validates it against an SsrfPolicy before making the request: scrape(), crawl(), batch_crawl(), sitemap fetches, robots.txt fetches, asset downloads, and link enqueue.

What is refused

Category Ranges
Loopback 127.0.0.0/8, ::1/128
Private (RFC 1918) 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16
Link-local / cloud metadata 169.254.0.0/16 (incl. 169.254.169.254), fe80::/10
Unspecified 0.0.0.0/8
Multicast 224.0.0.0/4, ff00::/8
IPv6 unique-local fc00::/7
Non-http(s) schemes file://, ftp://, gopher://, …

DNS rebinding mitigation

Checking the hostname at validation time is insufficient. An attacker can register
evil.example.com, serve a public IP at validation, then update DNS to point to
192.168.1.1 once the check passes.

The policy resolves every hostname via DNS and validates all returned IP addresses. If any resolved IP is in the deny list, the request is refused — regardless of what the others resolve to.

// From kreuzcrawl/src/net/ssrf.rs
let addresses: Vec<IpAddr> = tokio::net::lookup_host(&lookup_addr).await?
    .map(|addr| addr.ip())
    .collect();

for ip in &addresses {
    if !is_ip_permitted(*ip, policy) {
        return Err(SsrfError::DeniedByPolicy {
            reason: classify_private_ip(*ip),
        });
    }
}
Enter fullscreen mode Exit fullscreen mode

Redirect-chain re-validation

Each 30x Location header is re-resolved and re-validated before the next hop is taken. This closes the redirect-chain attack: a public URL that redirects to http://169.254.169.254/latest/meta-data/ is refused at the second hop. Redirect following is bounded by SsrfPolicy::max_redirects (default: 5).

Opting out

# Environment variable — applies to every crawler in the process
export KREUZCRAWL_ALLOW_PRIVATE_NETWORK=1
Enter fullscreen mode Exit fullscreen mode
// Per-config builder — applies to a single CrawlConfig
CrawlConfig::builder()
    .allow_private_networks(true)
    .ssrf_allowlist_host(HostMatcher::Cidr("10.0.0.0/8".into()))
    .build()
Enter fullscreen mode Exit fullscreen mode

!!! warning "Wasm targets"
On wasm32, SSRF checking is disabled — the browser's fetch API and same-origin
policy are the enforcing boundary, and tokio::net::lookup_host is unavailable in
that context.

WAF-aware tiered dispatch

Before v0.3.0, the dispatch decision was static: HTTP, or bypass-vendor, or browser — chosen at config time and fixed for the duration of the crawl. This had an obvious cost problem: routing every request through a bypass provider because 5% of pages are blocked is expensive.

The new engine chains tiers and escalates based on per-attempt signals.

Tiers and escalation strategies

pub enum Tier {
    Http,    // plain HTTP fetch
    Bypass,  // vendor-managed bypass (Zyte, ScrapingBee, Bright Data, …)
    Browser, // headless Chrome via Chromiumoxide
}

pub enum EscalationStrategy {
    None,              // HTTP only; surface all failures
    BrowserOnly,       // HTTP → Browser on block  ← default
    BypassFirst,       // always use bypass (legacy behaviour)
    BypassOnly,        // HTTP → Bypass on block; no browser
    BypassThenBrowser, // HTTP → Bypass → Browser; maximum resilience
}
Enter fullscreen mode Exit fullscreen mode

All dispatch enums are #[non_exhaustive] — new variants can be added without breaking downstream match arms.

WAF detection: Aho-Corasick over a TOML corpus

Detecting a WAF challenge page requires inspecting both response headers and body.
A naïve approach — one regex per fingerprint per response — scales as O(fingerprints × body_length). With 35 fingerprints that's expensive per page.

All body-pattern signals across all fingerprints are compiled into a single Aho-Corasick automaton at startup. One scan of the response body returns the set of matching pattern indices; each maps to a fingerprint via a flat Vec<usize>.

pub struct Rules {
    fingerprints: Vec<Fingerprint>,
    automaton: AhoCorasick,         // single automaton over all patterns
    pattern_to_fp: Vec<usize>,      // AC pattern index → fingerprint index
}
Enter fullscreen mode Exit fullscreen mode

The body scan is capped at 100 KB (CHALLENGE_BODY_LIMIT). WAF challenge pages are small; real content pages overwhelmingly exceed this threshold. This bounds scan cost without missing signals.

Header signals are checked first (constant time per fingerprint). If a fingerprint fires on headers alone, the body scan is skipped entirely.

Current corpus: 35 fingerprints across Cloudflare (10), DataDome (6), PerimeterX (5), Imperva (5), AWS WAF (4), F5 (2), Akamai (1), and generic corroborating patterns (2).

Hot-reload for live environments

The fingerprint corpus is a TOML file (rules/waf_fingerprints.toml). In Kubernetes deployments, it is managed as a ConfigMap — operators update signatures without restarting the process.

The compiled Rules is wrapped in arc_swap::ArcSwap. TomlClassifier::watch()
starts a filesystem watcher that atomically swaps the rule set when the file changes:

pub struct TomlClassifier {
    rules: ArcSwap<Rules>,
}

impl TomlClassifier {
    pub fn watch(self: &Arc<Self>, path: impl AsRef<Path>) -> Result<WatchHandle, WatchError> {
        watch::start_watch(Arc::clone(self), path.as_ref())
    }
}
Enter fullscreen mode Exit fullscreen mode

Events are debounced 500 ms — this handles both editors that write via tmpfile+rename and the Kubernetes ConfigMap atomic projection mechanism, which produces the same sequence of filesystem events.

Per-domain EWMA state

The engine tracks a block rate per domain using an Exponentially Weighted Moving Average. High block rates promote the starting tier: a domain that has been blocking consistently starts at Bypass or Browser rather than always attempting Http first.

The DomainStatePort trait is injectable:

#[async_trait]
pub trait DomainStatePort: Send + Sync + fmt::Debug {
    async fn recommend(&self, domain: &str) -> DomainRecommendation;
    async fn observe(&self, domain: &str, observation: &DomainObservation);
}
Enter fullscreen mode Exit fullscreen mode

The default implementation (EwmaDomainState) is wired in automatically.
kreuzberg-cloud replaces it with a distributed store for cross-instance domain intelligence.

Configuring dispatch

use std::sync::Arc;
use kreuzcrawl::{
    CrawlConfig, DispatchProfile, EscalationStrategy,
    SimpleRetryPolicy, TomlClassifier,
};

let config = CrawlConfig::builder()
    .dispatch(
        DispatchProfile::builder()
            .strategy(EscalationStrategy::BypassThenBrowser)
            .retry_policy(Arc::new(SimpleRetryPolicy::new().with_max_retries(3)))
            .waf_classifier(Arc::new(TomlClassifier::builtin()))
            .build(),
    )
    .build();
Enter fullscreen mode Exit fullscreen mode

MCP server at CLI parity

The MCP server now exposes tools 1:1 with the CLI — scrape, batch_scrape, batch_crawl, download, and generate_citations. Earlier releases had partial coverage; v0.3.0 closes the gap.

Safety annotations

Each tool declares three safety properties from the MCP spec:

Property Value Meaning
read_only true does not modify external state
destructive false does not delete or overwrite anything
open_world true makes network requests to caller-specified URLs

open_world: true is the meaningful one. MCP hosts can use it to apply additional
sandboxing or prompt for confirmation before an agent makes outbound requests. The SSRF
policy is the enforcement layer: a request to http://169.254.169.254/ returns a
SsrfPolicyViolation error before any network activity occurs.

Transport

The server runs in two modes depending on how it is invoked:

  • stdio — reads JSON-RPC from stdin, writes to stdout. Used by Claude Desktop, Cursor, and tools that spawn the binary as a subprocess.
  • Streamable HTTP at /mcp — used for service deployments. Enabled when the binary is built with --features api,mcp.
# stdio mode (subprocess)
kreuzcrawl mcp

# HTTP mode
kreuzcrawl serve  # exposes /mcp alongside the REST API
Enter fullscreen mode Exit fullscreen mode

Expanded CLI

Four subcommands complete the CLI's 1:1 mapping with the core and MCP surfaces:

Command Description
batch-scrape <urls…> Scrape multiple URLs concurrently, emit structured JSON
batch-crawl <urls…> Crawl from multiple seed URLs with shared concurrency budget
download <url> Fetch and save assets to disk (PDF, DOCX, images, …)
citations <url> Extract structured citations and references from a page
version Print version and build metadata
# Crawl two seeds, output Markdown
kreuzcrawl batch-crawl \
  https://docs.example.com \
  https://blog.example.com \
  --depth 3 \
  --format markdown
Enter fullscreen mode Exit fullscreen mode

Standalone robots.txt and sitemap parsers

kreuzcrawl::robots and kreuzcrawl::sitemap are now public modules, usable without
constructing a crawl engine:

use kreuzcrawl::robots::{parse_robots_txt, is_path_allowed};
use kreuzcrawl::sitemap::{parse_sitemap_xml, parse_sitemap_index};

// Standalone robots.txt check — both functions are infallible
let rules = parse_robots_txt(robots_body, "Googlebot");
let allowed = is_path_allowed("/private/", &rules);

// Standalone sitemap parse — infallible
let urls = parse_sitemap_xml(sitemap_body);

// Sitemap index (points to child sitemaps)
let index = parse_sitemap_index(index_body);
Enter fullscreen mode Exit fullscreen mode

This is useful for compliance tooling, link-graph builders, and crawl planners that need to evaluate robots.txt access rules or enumerate URLs from a sitemap without running a full crawl.

Browser pool and executor injection

BrowserPool, BrowserPoolConfig, NativeBrowserExecutor, and
NativeBrowserExecutorConfig are now public. Callers that run many crawls against the
same targets can construct and warm a pool once and reuse it:

use kreuzcrawl::{BrowserPool, BrowserPoolConfig, CrawlEngineBuilder};

let pool = BrowserPool::new(BrowserPoolConfig::default()); // sync, returns Arc<BrowserPool>
pool.warm().await?; // pre-open Chrome tabs up to pool capacity

let engine = CrawlEngineBuilder::new(config)
    .with_browser_pool(pool)
    .build()
    .await?;
Enter fullscreen mode Exit fullscreen mode

Without pool injection, each engine creates and tears down its own Chrome instance. With pool injection, the browser process persists across crawl jobs — useful when you are running many short crawls in a tight loop.

Observability

v0.3.0 adds two OpenTelemetry counters:

Counter Description
crawl_waf_blocks_total Number of times a WAF fingerprint fired, labeled by vendor
crawl_backend_escalations_total Number of tier escalations, labeled by source and target tier

These are emitted unconditionally via opentelemetry::global — no feature gate required. Consumers that do not configure an OTel exporter incur no overhead beyond the counter increment.

The WAF subsystem also gained property-based tests, cargo-fuzz targets covering the TOML corpus loader and Aho-Corasick automaton, and Criterion benchmarks measuring classification throughput at scale.

API stability

This is the first release kreuzcrawl declares stable. The commitments:

  • kreuzcrawl crate public surface is stable at MAJOR. Breaking changes will increment the major version.
  • C FFI ABI (kreuzcrawl-ffi) is stable at MAJOR.MINOR. Struct layouts are frozen at MAJOR.MINOR boundaries.
  • Dispatch enums (EscalationStrategy, EscalationReason, Tier, CrawlError, NetworkErrorKind) are #[non_exhaustive]. New variants are non-breaking; callers outside the crate must include wildcard arms.
  • Generated binding packages track the Rust core version. A binding at 0.3.x targets a Rust core at 0.3.x.

Upgrading from v0.2.0

The public API surface is largely additive. Two changes require attention:

CrawlError::WafBlocked is now a struct variant. The previous unit variant becomes CrawlError::WafBlocked { vendor, message }. Match arms that destructure it need updating:

// Before
CrawlError::WafBlocked => { /* handle */ }

// After
CrawlError::WafBlocked { vendor, message } => {
    eprintln!("blocked by {vendor}: {message}");
}
Enter fullscreen mode Exit fullscreen mode

SimpleRetryPolicy retry count is now exact. The previous implementation had an off-by-one: max_retries=3 produced 2 retries. The API also changed: new() now takes no arguments (defaults to 3 retries); use .with_max_retries(n) to override. Update call sites that were compensating for the off-by-one or passing a count to new().

What's next

v0.3.0 stabilises the core surface. The areas we are actively working on:

  • HostMatcher in language bindings. The allowlist field on SsrfPolicy is currently #[alef(skip)] — the untagged-enum FFI representation is not yet finalized. Expect it in a 0.3.x patch once the tagged-enum form is decided.
  • Proxy rotation provider. The ProxyProvider trait and StaticProxyProvider are public in this release; per-request proxy selection and rotation are landing in 0.3.x.
  • Expanded WAF corpus. The 35-fingerprint TOML corpus is externally reloadable and accepting contributions. Open a PR with a fixture response and a fingerprint block.

Resources

Source: dev.to

arrow_back Back to Tutorials