A Minimum-Viable RSS Aggregator in TypeScript with Hono

typescript dev.to

A Minimum-Viable RSS Aggregator in TypeScript with Hono

One JSON endpoint that merges N RSS feeds, deduplicates, sorts, and caches — ~600 lines of TypeScript, four runtime deps, a 147 MB Alpine container, and 46 tests that never touch the network.

📦 GitHub: https://github.com/sen-ltd/rss-aggregator

The problem

Every time I look at the self-hosted feed aggregator landscape I want to close the tab and go do something else. Tiny Tiny RSS is fine, FreshRSS is fine, Miniflux is actually great, but all of them solve a much bigger problem than the one I usually have.

The problem I actually have, over and over, is this:

Give me one JSON endpoint. I pass it a list of RSS feed URLs. It gives me back a deduplicated, date-sorted list of items. That's it.

Who wants that?

  • A Slack integration that posts a digest of 8 newsletters every morning.
  • A personal dashboard that shows the last 20 items across the 30 sources I care about.
  • A "what's new in my stack" summary for the team that stitches the Kubernetes release blog, the Postgres news feed, and the Rust blog into one card.
  • A cron job that feeds an LLM a rolling context window of industry news.

None of these need per-user auth. None of them need a database. None of them need OPML import or unread-item tracking or keyboard shortcuts or a three-pane reader UI. They just need one endpoint that merges feeds, and they want it to be cheap — cheap to deploy, cheap to reason about, cheap to fork when the inevitable "oh, I also want this" arrives.

So I built it. This post walks through the design: the dedup decision tree, the per-host concurrency throttle, why partial failures are a design principle and not a bug, and why I reached for fast-xml-parser instead of a "real" RSS library.

This is the TypeScript entry in the SEN portfolio. There's also a Rust equivalent — feed-parser — which solves the adjacent problem of one feed, as a CLI, with a hand-written zero-dep XML walker. Same itch, different trade-offs, different container size. Read them side-by-side if that's your kind of fun.

The stack

Four runtime dependencies. That's the whole list.

{"dependencies":{"@hono/node-server":"^1.13.0","fast-xml-parser":"^4.5.0","hono":"^4.6.0","zod":"^3.23.8"}}
Enter fullscreen mode Exit fullscreen mode
  • Hono — tiny, fetch-spec compatible, and app.request() lets me integration-test every route without spinning up a server.
  • @hono/node-server — the Node adapter.
  • zod — because runtime validation of urls and limit is the kind of code I refuse to write by hand again.
  • fast-xml-parser — the one opinionated pick. More on this in a moment.

No rss-parser. No feedparser. No node-fetch (Node 20 has fetch built in). No p-queue (I need ~40 lines of custom concurrency logic and didn't want to pull in a whole dep for it).

Why fast-xml-parser and not a "real" RSS library

I genuinely considered writing the XML parser by hand. The feed-parser Rust sibling does exactly that — it's only the subset of XML you need for RSS, and the walker is a few hundred lines of regex-and-state-machine code.

In TypeScript I talked myself out of it. Two reasons:

  1. You still end up dealing with the long tail of feed weirdness: CDATA sections, entity escaping, mixed-case tags, namespaces in Atom. A hand-written walker will handle 80% of feeds and then quietly corrupt the other 20%.
  2. fast-xml-parser is about 80 KB, has zero native dependencies, is well maintained, and handles the weirdness. The maintainer is diligent. The API is small enough that I never felt "locked in" — my parser.ts is the only file that imports it, and every other module consumes a normalized ParsedFeed shape I control.

The rule I follow in this kind of project: if a dep has to exist, it should be confined to one file, and the rest of the code should consume a shape I designed, not a shape the library handed me. That makes swapping the dep a one-file operation.

Design piece 1: the dedup decision tree

This is the heart of the thing. Two feeds that are mirrors of each other will rarely agree on a perfect identifier. Here's the tree I ended up with:

export function fingerprint(item: ParsedItem): string {
  if (item.guid && item.guid.trim().length > 0) {
    return `g:${item.guid.trim()}`;
  }
  const canonicalLink = canonicalize(item.link);
  if (canonicalLink && item.published) {
    return `l:${canonicalLink}|${item.published}`;
  }
  if (canonicalLink) {
    return `l:${canonicalLink}`;
  }
  if (item.title && item.published) {
    return `t:${item.title.trim()}|${item.published}`;
  }
  // Nothing to go on — make every such item unique so we don't
  // accidentally collapse distinct "untitled" rows into one.
  return `u:${item.source_url}:${Math.random()}`;
}
Enter fullscreen mode Exit fullscreen mode

The order matters, and each fallback has an argument behind it:

1. Prefer <guid> / <atom:id>. These are supposed to be globally unique. In practice they are... mostly. A trim takes care of leading/trailing whitespace, which is the most common source of false misses.

2. Fall back to canonical link + published date. The link alone isn't enough, because some publishers republish the same URL with a new date when they update an article. If you dedup on link-only, you silently drop the update from your reader. With link + date, the update is a distinct fingerprint — you see both, which is the behavior I want.

3. Canonical link alone. When the first two failed and the feed didn't give you a date. Better than nothing.

4. Title + published. When the feed didn't give you a link either. This happens with some badly-formed Atom feeds where every <link> is rel="self".

5. Give up. Generate a unique random key so the item still appears exactly once and doesn't fight with other "untitled" items.

The "canonical link" logic is the other half of this. Most dupes in practice come from tracking parameters:

export function canonicalize(link: string | null | undefined): string | null {
  if (!link) return null;
  try {
    const u = new URL(link);
    u.hostname = u.hostname.toLowerCase();
    const keep = new URLSearchParams();
    for (const [k, v] of u.searchParams) {
      if (k.toLowerCase().startsWith('utm_')) continue;
      if (k.toLowerCase() === 'fbclid' || k.toLowerCase() === 'gclid') continue;
      keep.append(k, v);
    }
    u.search = keep.toString();
    let result = u.toString();
    if (result.endsWith('/') && u.pathname !== '/') {
      result = result.slice(0, -1);
    }
    return result;
  } catch {
    return link.trim();
  }
}
Enter fullscreen mode Exit fullscreen mode

Lowercase the host (but not the path — some sites are case-sensitive). Strip utm_*, fbclid, gclid. Strip one trailing slash, but preserve the root slash (otherwise https://example.com/ becomes https://example.com which is technically different). This covers about 95% of "same article, different fingerprint" in my testing.

Notice what's not here: no full URL normalization library, no url-normalize, no protocol upgrading, no default-port stripping. You don't need any of that for feed dedup. The canonical form is whatever makes mirrors converge; everything else is yak-shaving.

Design piece 2: per-host concurrency, not just a global cap

Most "fetch N URLs in parallel" solutions use one knob: max N in flight. That's the obvious way. It's also the way you get flagged by a feed host when 8 of the 30 URLs in your set happen to live on the same domain.

I wanted two knobs: a global concurrency cap (so a 50-feed request doesn't open 50 sockets) AND a per-host cap (so 8 feeds from one host still behave politely). Here's the loop:

async function mapWithLimits<T>(
  urls: string[],
  globalMax: number,
  perHostMax: number,
  fn: (url: string) => Promise<T>,
): Promise<(T | null)[]> {
  const results: (T | null)[] = new Array<T | null>(urls.length).fill(null);
  let inFlight = 0;
  const hostInFlight = new Map<string, number>();
  let next = 0;

  return new Promise((resolve) => {
    const tick = (): void => {
      if (next >= urls.length && inFlight === 0) {
        resolve(results);
        return;
      }
      while (next < urls.length && inFlight < globalMax) {
        const host = hostOf(urls[next]!);
        if ((hostInFlight.get(host) ?? 0) >= perHostMax) break;
        const i = next++;
        inFlight += 1;
        hostInFlight.set(host, (hostInFlight.get(host) ?? 0) + 1);
        void fn(urls[i]!)
          .then((v) => { results[i] = v; })
          .catch(() => { results[i] = null; })
          .finally(() => {
            inFlight -= 1;
            hostInFlight.set(host, (hostInFlight.get(host) ?? 1) - 1);
            tick();
          });
      }
    };
    tick();
  });
}
Enter fullscreen mode Exit fullscreen mode

The tick() function is called whenever something finishes, and it tries to launch as many new fetches as both caps allow. If the global cap has room but the next URL's host is saturated, the loop stops for now — it'll resume as soon as any feed on that host completes.

One subtlety worth flagging: the loop breaks on a saturated host rather than skipping to the next URL. That's intentional. In-order starvation is easier to reason about (and to test) than out-of-order reordering, and you don't need much imagination to see how the fancier version deadlocks.

Testing this was the question I cared about most. I didn't want to mock setTimeout globally — that gets ugly fast. Instead I made the fetcher transport injectable, and the test passes in a fake that sleeps for 30 ms and tracks peak in-flight count:

it('respects per-host concurrency=1 so same-host feeds are sequential', async () => {
  let inFlight = 0;
  let peak = 0;
  const transport: Transport = async (_url, _init) => {
    inFlight += 1;
    peak = Math.max(peak, inFlight);
    await new Promise((res) => setTimeout(res, 30));
    inFlight -= 1;
    return new Response(fix('rss2-a.xml'), { status: 200 });
  };
  const app = createApp({
    fetcherOptions: { transport },
    perHostConcurrency: 1,
  });
  const res = await app.request(
    '/aggregate?url=https://same.test/a&url=https://same.test/b&url=https://same.test/c',
  );
  expect(res.status).toBe(200);
  expect(peak).toBe(1);
});
Enter fullscreen mode Exit fullscreen mode

Three URLs, same host, perHostConcurrency=1, peak in-flight must be 1. If the loop were wrong the peak would be 3 and the test would scream.

Design piece 3: partial failures as a design principle

When you aggregate 30 feeds, on any given day 1–3 of them will be timing out, returning HTML (because the CDN routed you to a maintenance page), or 301-ing to a URL whose SSL cert just expired. The aggregator has a choice:

Option A: fail the whole request, return a 500, and let the client figure it out.

Option B: succeed with whatever feeds worked, and list the broken ones alongside the successes.

Option A is the default most libraries give you because it's what Promise.all does. Option B is the one a human wants. I went with B, and it shows up in the errors[] field of every aggregate response:

{"items":[...],"feed_count":29,"item_count":20,"errors":[{"url":"https://broken.example/feed.xml","reason":"http: http 500"}],"cache_hits":0,"cache_misses":30}
Enter fullscreen mode Exit fullscreen mode

The client can choose what to do with errors[] — render them in the UI, log them, page the on-call — but the items field is still useful. A single bad feed doesn't cost the user their morning digest.

Implementation is a 5-line catch inside the per-feed resolver:

try {
  const fetched = await fetchFeed(url, fetcherOptions);
  const parsed = parseFeed(fetched.body, url);
  cache.set(url, parsed);
  return parsed;
} catch (err) {
  errors.push({ url, reason: describeError(err) });
  return null;
}
Enter fullscreen mode Exit fullscreen mode

null flows through the merge as "skip", and errors[] is the side-channel. The interesting part is that this is contagious — once you commit to partial success at one layer, you find yourself wanting it at every layer. The parser's parseDate returns null on unparseable dates instead of throwing. The canonicalizer's URL parse falls back to the raw string. None of these individual choices are heroic; the discipline is being consistent about it.

Design piece 4: the TTL cache as the cheap backbone

Every call to /aggregate with the same URL hits the same feed. The feed probably updates once an hour. There is no scenario in which you want to actually fetch the feed every time a user hits the endpoint.

So: in-memory TTL cache, keyed by URL, value is the parsed feed. 5-minute default TTL. The cache takes an injectable clock so tests can advance time deterministically:

export class TTLCache<T> {
  private readonly store = new Map<string, Slot<T>>();
  private readonly ttlMs: number;
  private readonly now: Clock;
  private hits = 0;
  private misses = 0;

  constructor(ttlMs: number, now: Clock = Date.now) {
    if (!Number.isFinite(ttlMs) || ttlMs < 0) {
      throw new RangeError(`ttlMs must be >= 0, got ${ttlMs}`);
    }
    this.ttlMs = ttlMs;
    this.now = now;
  }

  get(key: string): T | undefined {
    const slot = this.store.get(key);
    if (!slot) { this.misses += 1; return undefined; }
    if (this.now() - slot.fetchedAt > this.ttlMs) {
      this.store.delete(key); // lazy expiry
      this.misses += 1;
      return undefined;
    }
    this.hits += 1;
    return slot.value;
  }
  // ...
}
Enter fullscreen mode Exit fullscreen mode

Two things worth flagging:

Lazy expiry. Expired slots are dropped on read, not by a sweep timer. A timer would be one more moving part; lazy expiry is two extra lines and has no ordering bugs. The downside — zombie entries that nobody reads stay in memory until the process restarts — is fine because the whole cache is bounded by the number of unique feed URLs you've ever requested. If you're aggregating 10,000 distinct feeds across a long-running process, you have bigger architectural choices to make than whether to set an interval.

Injectable clock. The alternative is vi.useFakeTimers(). That works, but it makes the test reach into vitest's internals, and it interferes with any code in the system that actually wants to sleep. A parameter is simpler and composes better:

let now = 0;
const c = new TTLCache<string>(1000, () => now);
c.set('k', 'v');
now = 2000;
expect(c.get('k')).toBeUndefined();
Enter fullscreen mode Exit fullscreen mode

Two lines of test, no global state mutation.

The cache hit/miss counters show up in /health:

{"status":"ok","version":"0.1.0","cache_hits":142,"cache_misses":38,"cache_size":12}
Enter fullscreen mode Exit fullscreen mode

— which is the "how is the cache doing?" question you'll ask five minutes after deploying it.

Tradeoffs (so you know what you're getting)

Let me be honest about what this thing isn't:

  • No persistence. Restart the process, the cache is empty. For a small service in front of a reader UI this is fine — the first request after restart is slow, subsequent ones are fast. If you need persistence across restarts, you want Redis or Miniflux, not this.
  • No full-text search. /aggregate doesn't let you filter on anything but URL + limit. If you want "everything mentioning Kubernetes", pipe the output to jq.
  • No push, only poll. There is no WebSub / PubSubHubbub. The worst case latency for a new item is ttlMs + poll-interval-of-your-cron. For a morning digest that's fine.
  • No OPML import. You pass URLs in the query string or a JSON body. If you have an OPML file, xmllint --xpath it into a shell loop and POST the result.
  • No auth. Put it behind whatever you put everything else behind — Cloudflare Access, a reverse proxy, a VPN.

If none of these are dealbreakers, this is the right size. If any of them are, you want a bigger tool.

Try it in 30 seconds

Node 20 or Docker, pick one.

git clone https://github.com/sen-ltd/rss-aggregator
cd rss-aggregator
npm install
npm run dev &

curl -sS 'http://localhost:8000/aggregate?url=https://hnrss.org/frontpage&url=https://lobste.rs/rss&limit=5' | jq .
curl -sS 'http://localhost:8000/health'
Enter fullscreen mode Exit fullscreen mode

Or the container:

docker build -t rss-aggregator .
docker run --rm -p 8000:8000 rss-aggregator
Enter fullscreen mode Exit fullscreen mode

Alpine image is 147 MB, non-root user, zero runtime filesystem writes.

What the test surface looks like

46 tests, no network, fixture-driven. Distribution:

  • parser.test.ts (11) — RSS 2.0 field mapping, Atom field mapping, rel="alternate" preference, <content> fallback, error cases, RFC 822 → ISO date normalization.
  • dedup.test.ts (13) — canonicalize rules (utm, trailing slash, root slash, invalid), fingerprint layer ordering, dedup keeps first, sort order (newest first, date-less to bottom).
  • cache.test.ts (7) — hit, miss, expiry, overwrite refreshes, delete, clear, invalid TTL.
  • http.test.ts (15) — the full HTTP surface via app.request(): GET/POST aggregate, merge + dedup + sort truncation, partial failure isolation, cache hit on second call, Atom + RSS in the same request, include_content drops summary, 400 on non-URL, 400 on bad JSON, 502 on upstream fail, 400 on missing param, per-host concurrency cap.

The per-host concurrency test is my favorite because it's the one that would be painful with a real HTTP mock: I get to count peak in-flight calls deterministically because the transport is just a plain async function.

What I'd change if this were bigger

  • Compression. Accept gzip / br for outbound requests. Most feed hosts serve compressed, and fetch() negotiates by default, but the size cap runs against the decoded body — a 2 MB decoded feed could be 400 KB on the wire. Fine for now.
  • Conditional requests. If-Modified-Since / ETag would let us skip downloads entirely when the feed hasn't changed. Worth it if you're running this against 100+ feeds on a 1-minute poll.
  • Disk cache. For crash resilience and cold-start speed. Probably sqlite + one table keyed on URL. But this is the exact feature that turns a "tiny service" into "another Miniflux", and that's not what I wanted.

Closing

The thesis of this build was: there is a real, sharp-edged gap between "curl a feed" and "self-hosted reader", and a 600-line Hono service fits right inside it. The dedup tree is the one piece of non-obvious design; everything else is the usual discipline of making the moving parts injectable and then writing tests that verify each one in isolation.

If you've been wanting one JSON endpoint that merges your sources, you can have it today. The whole repo is small enough to read over one coffee.

📦 GitHub: https://github.com/sen-ltd/rss-aggregator

If Rust is more your speed, the companion project is feed-parser — same problem space, hand-written XML walker, no dependencies, even smaller container.

Source: dev.to

arrow_back Back to Tutorials