All Posts
EngTools Blog

Server-Side Security Risks When Fetching User-Supplied URLs

March 30, 202611 min read

Building a tool that fetches URLs sounds simple. The user types in a URL, your server downloads it, processes the content, and returns results. Straightforward, right?

Not even close.

The moment a server agrees to fetch a URL on behalf of an anonymous internet user, it opens a door that attackers are very familiar with. We recently shipped the Sitemap URL Extractor on EngTools — a tool that downloads and parses XML sitemaps from any website you point it at. While building it, we worked through several non-obvious security threats that any developer building a similar “fetch-for-me” service needs to understand.

This post explains each threat in plain terms, what could go wrong, and how we addressed them.


What’s the Attack Surface?

When you build a service that fetches URLs, your server becomes a proxy. The attacker is no longer limited to what they can do from their own laptop — they can now issue HTTP requests from your server’s IP address, against targets they couldn’t reach directly.

This is the entire class of vulnerabilities we’re dealing with.


Threat #1: Server-Side Request Forgery (SSRF)

What is it?

SSRF (Server-Side Request Forgery) is the most fundamental threat in this category. The idea is simple: an attacker provides a URL that points not to a legitimate public website, but instead to an internal network resource that only your server can access.

Consider a scenario: your server is hosted on AWS. Amazon automatically exposes a metadata endpoint at http://169.254.169.254/latest/meta-data/ — accessible only from within the same instance. This endpoint contains sensitive credentials, IAM tokens, and configuration data that can be used to compromise your entire cloud infrastructure.

If a user submits that URL to your fetch tool, and your server naively fetches it, the response gets sent straight back to the attacker. They’ve just stolen your cloud credentials without ever needing to breach your network.

The same attack applies to:

  • Localhost services (http://localhost:6379 might expose an unprotected Redis database)
  • Internal admin panels (http://10.0.0.1/admin)
  • Other machines on the same private network

How we mitigate it

Before our server fetches any URL, we run it through a multi-stage safety validator:

1. Protocol Scheme Validation: Most runtime fetch clients support multiple protocols, not just HTTP. If we don’t enforce http:// or https://, an attacker could submit file:///etc/passwd or use gopher:// and dict:// to interact with internal TCP services. We strictly allowlist only http and https.

2. IP and Hostname Blocklisting: We reject any hostname that maps to a private address:

  • Localhost (127.0.0.1, ::1, localhost)
  • Private subnet ranges (10.x.x.x, 192.168.x.x, 172.16-31.x.x)
  • Link-local addresses (169.254.x.x) — this includes the AWS metadata endpoint
  • Internal DNS suffixes (.local, .internal)
  • IPv6 private ranges (like fd00::/8)

3. URL Confusion / Escaping Defense: Naive regex blocklists often fail because IP addresses can be represented in multiple ways. http://0177.0.0.1/ (octal) or http://[::ffff:169.254.169.254]/ (IPv4-mapped IPv6) can breeze past simple string matching. We rely on the core URL parser to normalize hostnames before evaluating strings, though as you’ll see below, string-based IP blocking is fundamentally imperfect until combined with socket-level pinning.

If the URL fails these checks, we return an error immediately and the fetch never happens.


Threat #2: DNS Rebinding

What is it?

DNS Rebinding is a more sophisticated cousin of SSRF, and it’s designed specifically to bypass the kind of URL validation described above. Here’s why it works.

When you request http://evil.attacker.com, your server checks the domain’s IP address and validates it. If it resolves to 1.2.3.4 (a normal public IP), the check passes. But here’s the trick: the attacker controls their own DNS server. They set the TTL (Time To Live) on their DNS record to zero — meaning the result is never cached.

What happens next:

  1. Your server resolves evil.attacker.com → gets 1.2.3.4 ✅ (passes validation)
  2. Your server initiates the actual HTTP TCP connection.
  3. Between steps 1 and 2, the attacker changes their DNS record to point to 169.254.169.254.
  4. Your server implicitly resolves the domain again to open the socket → now connects to 169.254.169.254 → fetches AWS credentials!

This is a classic Time-of-Check vs. Time-of-Use (TOCTOU) race condition. Because the safety check and the actual network fetch are two separate operations, the IP address can change between them within a single HTTP request.

How we mitigate it

This is the hardest SSRF vector to defeat, and it requires venturing into Node.js networking internals.

You cannot fix this at the HTTP layer, and inspecting redirects does nothing to stop it (since the IP swap happens underneath the HTTP protocol). The only true mitigation is to ensure the DNS resolution happens exactly once.

The correct sequence is:

  1. Manually resolve the hostname to an IP address yourself.
  2. Validate that specific IP against your private address blocklist.
  3. Make the socket connection directly to that validated IP address, while manually passing the original hostname in the Host HTTP header so the target server handles virtual hosting correctly.

In Node.js, this means passing a custom lookup function to a custom http.Agent, or relying on a library like happy-eyeballs with a hardened resolver.

We are actively evaluating rolling out a custom HTTP Agent with pinned DNS resolution across our fetch utilities to completely close this TOCTOU window. For now, while our URL blocklist stops naive SSRF, DNS Rebinding remains the final boss of this threat model.


Threat #3: Open Redirect SSRF

What is it?

Imagine your server makes a request to a perfectly safe URL: https://legitimate-site.com/sitemap.xml. The server at legitimate-site.com responds with:

HTTP/1.1 302 Found
Location: http://169.254.169.254/latest/meta-data/

If your fetch client blindly follows redirects (which most HTTP clients do by default), your server just fetched credentials from the metadata endpoint — even though the original URL looked totally innocent.

This is an Open Redirect SSRF. The attacker doesn’t need to provide a malicious URL directly. They just need a public website that will redirect your server somewhere private. These redirect chains can be multiple hops deep.

How we mitigate it

We rewrote our fetchWithTimeout utility to manually walk redirect chains rather than using redirect: 'follow'. For every 3xx response, we:

  1. Extract the Location header
  2. Resolve it against the current URL (handling relative redirects)
  3. Run it through our SSRF blocklist (isPrivateUrl())
  4. Only then make the next request

If any hop in the redirect chain tries to send us to a private address, we throw an error and abort. We also cap the number of redirects at 5 to prevent infinite redirect loops.


Threat #4: XML External Entity (XXE) Injection

What is it?

Once you’ve securely fetched a file, you still have to process the response. For a Sitemap Extractor, this means parsing XML.

Standard XML specification includes a feature called External Entities, which allows an XML document to define variables that are loaded from external URIs (including local files) when parsed. If you use a standard XML parser with default settings (like libxml) and feed it an untrusted sitemap containing <!ENTITY xxe SYSTEM "file:///etc/shadow">, the parser will read your server’s local password file and embed it into the output.

When your tool returns the “extracted URLs” back to the attacker, it’ll dump your server’s sensitive local files right onto their screen.

How we mitigate it

We completely avoid DOM-based XML parsers for untrusted input.

Instead of parsing the document into a strict XML DOM (which carries all the historical baggage and vulnerabilities of the XML spec), our extractor leverages a highly constrained Regex state-machine (/<loc>(.*?)<\/loc>/gi) running over a raw text buffer.

By treating the response strictly as a dumb text stream rather than a relational XML document, we fundamentally eliminate the entire class of XXE attacks. The regex will only ever extract text sandwiched exactly between <loc> tags, ignoring <!DOCTYPE> or <!ENTITY> declarations entirely.


Threat #5: Memory Exhaustion via Binary Files

What is it?

This is less of a security attack and more of a denial of service and resource abuse vector. Our tool accepts a URL and downloads the file to process it. What if someone submits the URL of a 4GB video file, or a binary database dump?

Our server would dutifully begin streaming the response body into memory, consuming more and more RAM until either the process crashes or the machine runs out of memory. If we’re running multiple concurrent requests (which we are — we process sitemaps in parallel), this could take down the entire API server.

How we mitigate it

We mitigate this in two stages: an optimization check, and a hard security boundary.

1. The Courtesy Optimization (Content-Type Pre-check): Before downloading the body into memory, we inspect the Content-Type response header. If the server explicitly says it’s returning an image (image/png) or video, we abort the request immediately. This is a fast, cheap optimization to drop garbage early.

2. The Hard Security Boundary (Byte Cap): However, any server can lie about its Content-Type. A malicious server could return Content-Type: text/xml and then stream 10GB of random bytes. Therefore, the Content-Type check is not a security control.

Our actual defense is validating the streaming buffer length against a 50MB hard cap. If the accumulated byteLength crosses this threshold during the download, we terminate the array buffer and drop the connection. We never allow untrusted responses to consume unbounded RAM.


Threat #6: Algorithmic Complexity Abuse

What is it?

This is a subtle performance attack. Our sitemap crawler maintains a queue of URLs to fetch. When we discover a new child sitemap URL inside a sitemap index file, we need to check whether we’ve already queued that URL to avoid duplicate work.

The naive implementation does this:

if (!queue.find(q => q.url === childUrl)) {
  queue.push({ url: childUrl });
}

Array.find() scans the entire array from the beginning every time. If the queue has 1,000 items and we’re checking 1,000 new URLs, that’s 1,000,000 comparisons — O(n²) complexity. An attacker who crafts a sitemap index with thousands of near-duplicate entries could make our server’s CPU spin for a long time on a single request, denying service to other users.

How we mitigate it

We replaced the queue.find() call with an O(1) Set lookup. We already maintained a processedUrls Set for deduplication purposes. Adding a URL to this Set before pushing to the queue means the “have I seen this?” check is a single hash lookup regardless of how many URLs we’ve processed:

// Before: O(n) per check
if (!queue.find(q => q.url === childUrl)) { ... }

// After: O(1) per check  
if (!processedUrls.has(childUrl)) {
  processedUrls.add(childUrl);
  queue.push({ url: childUrl });
}

Rate Limiting and Daily Quotas

Beyond the specific threats above, we apply two layers of general abuse prevention:

  • Per-minute rate limiting: Each IP address can make at most 15 extraction requests per minute. Implemented via @fastify/rate-limit.
  • Daily quota: Each IP is limited to 30 extractions per day. This prevents sustained automated scraping that tries to use our infrastructure as a free proxy.

These don’t stop a determined attacker with many IPs (botnets exist), but they’re highly effective at stopping casual abuse and keep our infrastructure costs predictable.


The Broader Lesson

Building a URL-fetching service feels like boring infrastructure work. In practice, it means you’re operating a proxy server on the open internet, and every input is a potential vector.

The threat model is:

  1. Everything you can reach from your server, an attacker can reach by proxy
  2. Validation checks at request time can be subverted by DNS tricks
  3. Resource consumption is an attack surface, not just a performance concern
  4. Algorithm complexity is a security property, not just an optimization concern

The defenses aren’t complicated individually. The challenge is knowing which ones you need before you find out the hard way that you didn’t.


The Sitemap URL Extractor — and all the security hardening discussed here — is live and free at engtools.dev/tools/sitemap-url-extractor. All of the server utilities are open and available to review.