One install, many customers: building airtight multi-tenancy into a self-hosted security platform

I build a self-hosted security platform in Go. Today I shipped one of its biggest capabilities: one install that can host many customers, each one fully sealed off from the others. Multi-tenancy, done so the walls between customers are airtight.

Here's what I built and how it works.

The goal: two deployment models from one codebase

I wanted both of these to be true at the same time:

One install per customer — the simple, self-contained mode (everything defaults to a single tenant).
One install, many customers — a central server hosting a whole book of clients, each isolated.

The second is just the first with more than one customer in it. The hard part is the word airtight: Customer A must never read or touch Customer B's data, by any path. So I designed isolation in two directions — the read side (what an operator sees) and the write side (what agents on customer machines send in).

Read side: every query knows its customer

Every table already carried a tenant_id. The work was making sure every by-id lookup is scoped to the customer you're actually viewing:

func (db DB) GetIncident(id, tenantID int64) (IncidentRow, error) {
    // ... WHERE id = ? AND tenant_id = ?
    // an id from another customer simply returns "not found"
}

The operator's current customer lives in their session; the dashboard and the API both resolve it the same way and thread it down to the store. Pick a customer in the switcher and the entire dashboard — every page, every alert, every agent — follows it. Default to a single tenant and the behavior is identical to the single-install mode, so nothing about the simple path changes.

I locked this down at the data layer rather than trusting handlers to remember, and I proved it with an adversarial test: stand up two customers, then assert one can't read or mutate the other across every domain. That test is the real deliverable — the scoping is just what makes it pass.

Write side: bind the agent to the customer at install time

Agents run on a customer's machines (think a ZDS agent on their domain controller) and report back. Their data has to land in their tenant and nowhere else. Before building this I went and read how the people who do it at scale handle it.

Some folks bind an agent to a customer at install time with a token — a per-organization secret baked into the deploy. The agent never gets to say which customer it belongs to; the token already decided. While others pins its agent channel with a certificate so nothing can impersonate it. That's the model, and it's a good one, so I built the same thing.

Each customer gets a per-tenant enrollment secret. An agent can only register into the customer whose secret it presents:

// constant-time compare; empty secret = "no token required" (single-install back-compat)
if enrollSecret != "" &&
   subtle.ConstantTimeCompare([]byte(req.EnrollSecret), []byte(enrollSecret)) != 1 {
    writeError(w, http.StatusForbidden, "invalid or missing enrollment secret")
    return
}

Every agent also carries its own mTLS client certificate, issued at enrollment. So when an agent reports in, the body's agent_id has to match the certificate's common name, and the tenant always comes from the enrolled record — never from anything the request claims:

// the agent id is client-supplied; bind it to the verified cert so an agent
// can't report (and write) as another agent / another customer
if !s.agentCertMatches(r, req.AgentID) {
    writeError(w, http.StatusForbidden, "client certificate does not match agent id")
    return
}

The control plane gets the same treatment: an operator can only see and command agents that belong to the customer they're viewing.

// 404 unless the agent belongs to the caller's tenant — you can't read, control,
// or even confirm the existence of another customer's machines
func (s Server) agentInTenant(w http.ResponseWriter, r http.Request, agentID string) *model.Agent {
    a, _ := s.db.GetAgentByAgentID(agentID)
    if a == nil || a.TenantID != tenantID(r) {
        writeError(w, http.StatusNotFound, "agent not found")
        return nil
    }
    return a
}

Products per customer

The same platform sells differently to different customers, so add-on capabilities — EDR, identity threats, attack-surface, patch management, and the rest — are toggled per customer. The nav, the route gating, and the settings toggles all key off the current customer. A new customer starts with the core platform and nothing else; you turn on what they bought. There's a Customers screen to create a customer (which hands you its one-time enrollment token), a switcher to move between them, and a one-click way to rotate a token if it's ever exposed — old agents keep running, only new enrollments need the new one.

Proving it, not assuming it

Isolation isn't a property you design in — it's a claim you have to attack until it stops being true. So there are two layers of adversarial tests: a store-level suite that proves cross-tenant reads and writes fail across every domain, and an end-to-end HTTP test that stands up two customers and an agent and confirms one can't read another's data or dispatch a command to another's machine. A couple of deliberate trade-offs are written down rather than left implicit — the operator API key is the admin credential, not a tenant boundary, and mTLS has to be on for the agent-identity binding to hold.

Tamper-evidence: the records can't be quietly rewritten

Isolation keeps customers apart. The other half of trust is making sure the record of what happened can't be silently altered — including by someone who's already inside the box. For that, the platform anchors its important events to an immutable, hash-chained, signed ledger I built called SaintChain.

Four kinds of event get anchored: audit actions (who did what), finding creation and state changes, backup events (including each scan run's Merkle root), and chain-of-custody evidence handling. Each one is a hash chained onto the last and signed, so any after-the-fact edit breaks the chain and chain verify catches it.

It runs in two modes:

# embedded ledger — zero config, on by default
zds-core serve --chain
# tamper-PROOF — anchor to a separate SaintChain daemon (its own OS user/process)
zds-core serve --chain-endpoint http://127.0.0.1:7433

Embedded is the convenient default. Remote is the strong one: because the SaintChain daemon runs as its own process and owns the signing key, a compromised app can append to history but can't rewrite it — the thing holding the pen isn't the thing being audited. Add an off-host checkpoint of the chain head and even a fully-compromised host can't quietly edit the past without the discrepancy surfacing somewhere it doesn't control.

So multi-tenancy gives you walls between customers; SaintChain gives you a record of what happened inside those walls that you can actually trust.

Then I made it teach itself

A capability you can't explain isn't worth much, so the second half of the day was documentation — an architecture write-up and a full operator manual for the dashboard.

Then I built a pipeline to turn that manual into narrated how-to videos, entirely locally, no paid services:

Voice — Piper TTS, a neural voice rendered offline.
Screenshots — headless Chrome driving the real dashboard, seeded with demo data.
Assembly — ffmpeg: clean slides with soft fades, a mouse pointer that glides to whatever the narration is describing, and live subtitles synced sentence-by-sentence (I render each sentence's audio separately so the timing is exact).

A few hard-won ffmpeg notes, in case they save you an evening:

ffmpeg inside a while read loop eats the loop's stdin and corrupts your file list — pass -nostdin.
An unbounded -loop 1 overlay input makes ffmpeg encode frames forever — bound it with -t and overlay=...:shortest=1.
The Ken Burns zoompan jitters on slow moves because it crops on integer pixels; static slides look smoother than bad motion.
TTS that peaks at 0 dB clips and sounds harsh — loudnorm to -16 LUFS with headroom fixes it.

Nine videos, about five and a half minutes, the whole dashboard explained in its own voice. The pipeline is four small scripts; adding a section is editing a spec file.

Where this leaves the platform

One server can now run a whole roster of customers with real walls between them, agents that can only ever belong to the customer they were deployed for, products sold per customer, and a documentation set that explains the whole thing. It's the difference between a tool I run for myself and a platform I could run for many — built the way I want to build everything: airtight, and proven.

You can find the platform at szdsecurity.com.