Building a Video Health Probe with Prometheus Exporters in Go

go dev.to

At ViralVidVault we surface thousands of European viral videos every hour, and last quarter we noticed an ugly pattern. About four percent of our recommended videos were returning 410 or 451 within hours of being indexed. Geo-blocked, region-restricted, or pulled by uploaders. Users would land on a "Video unavailable" placeholder from a feed we had supposedly curated, and the metric did not show up in any backend log because the YouTube iframe failed client-side after the page had already been served as HTTP 200 from LiteSpeed.

I needed continuous, cheap, server-side probing of every active embed in our SQLite catalog, with Prometheus-shaped output so our existing Grafana and Alertmanager pipeline could consume it without standing up a new control plane. This article walks through the exporter we wrote in Go, the trade-offs of running it next to a PHP 8.4 plus LiteSpeed stack, and how we route the alerts back into the publishing workflow without rewriting our schema layer.

Why a separate Go exporter

Our main stack is PHP 8.4 on LiteSpeed with a SQLite WAL backing store and Cloudflare Workers in front. PHP is excellent for templated pages and cron-driven feed building, but a continuous probe loop is exactly the wrong workload for it.

  • Each probe is a network-bound HTTP request that can stall for up to thirty seconds.
  • We need bounded concurrency across sixty to one hundred thousand tracked videos per region.
  • The Prometheus /metrics endpoint must respond within a few hundred milliseconds even while the loop is hot.

PHP-FPM under LiteSpeed would either tie up workers or require a separate long-running daemon, and you end up reimplementing goroutines on top of pcntl_fork. Go gives us cheap goroutines, a battle-tested prometheus/client_golang library, and a single static binary we can drop next to the web server with a systemd unit. The exporter listens on 127.0.0.1:9114 and is scraped by Prometheus over the loopback only. Cloudflare never sees it, and there is no public attack surface to harden.

Metric design before code

This is the part most exporter tutorials skip, and it is the part that locks you into bad dashboards for years. We landed on four metrics after about an hour of whiteboard arguing.

  • viralvidvault_video_probe_total{region,result} is a counter incremented per probe, with the result label taking one of ok, unavailable, geoblocked, private, timeout, or network_error.
  • viralvidvault_video_probe_duration_seconds{region} is a histogram with buckets aligned to our latency SLO: 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s.
  • viralvidvault_video_catalog_size{region,status} is a gauge of how many videos we currently track per region per last-known state.
  • viralvidvault_video_probe_lag_seconds{region} is a gauge of seconds since the oldest due video in the queue was probed.

Three rules I follow whenever I add a metric.

  1. Cardinality budget per metric. Eight regions times six result values is forty-eight series for the probe counter. That is acceptable. I do not put video IDs in labels because that would explode to over one hundred thousand series and crash the local Prometheus instance inside a day.
  2. One unit per name. _seconds, _bytes, _total. No duration_ms, no latency. Pick the SI unit and stick to it across the whole exporter.
  3. Histograms for latency, counters for events, gauges for state. Mixing these is the single most common rookie mistake and it makes Grafana queries painful forever.

The collector skeleton

Here is the wiring for the metrics and the HTTP handler. This is the file you copy on every new exporter.

package main

import (
    "context"
    "database/sql"
    "log/slog"
    "net/http"
    "os"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
    _ "modernc.org/sqlite"
)

var (
    probeTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Namespace: "viralvidvault",
            Subsystem: "video",
            Name:      "probe_total",
            Help:      "Total embed probes performed, labelled by region and result.",
        },
        []string{"region", "result"},
    )
    probeDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Namespace: "viralvidvault",
            Subsystem: "video",
            Name:      "probe_duration_seconds",
            Help:      "Wall-clock duration of a single embed probe.",
            Buckets:   []float64{.05, .1, .25, .5, 1, 2, 5},
        },
        []string{"region"},
    )
    catalogSize = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Namespace: "viralvidvault",
            Subsystem: "video",
            Name:      "catalog_size",
            Help:      "Number of videos tracked, by region and probe status.",
        },
        []string{"region", "status"},
    )
    probeLag = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Namespace: "viralvidvault",
            Subsystem: "video",
            Name:      "probe_lag_seconds",
            Help:      "Age of the oldest unprobed video in the queue.",
        },
        []string{"region"},
    )
)

func init() {
    prometheus.MustRegister(probeTotal, probeDuration, catalogSize, probeLag)
}

func main() {
    logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))
    db, err := sql.Open("sqlite", "file:/var/lib/vvv/catalog.db?mode=ro&_pragma=busy_timeout(5000)")
    if err != nil {
        logger.Error("open db", "err", err)
        os.Exit(1)
    }
    defer db.Close()

    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()

    go runProbeLoop(ctx, logger, db)
    go runCatalogGauges(ctx, logger, db)

    mux := http.NewServeMux()
    mux.Handle("/metrics", promhttp.Handler())
    mux.HandleFunc("/healthz", func(w http.ResponseWriter, _ *http.Request) {
        w.WriteHeader(http.StatusOK)
    })

    srv := &http.Server{
        Addr:              "127.0.0.1:9114",
        Handler:           mux,
        ReadHeaderTimeout: 3 * time.Second,
    }
    logger.Info("listening", "addr", srv.Addr)
    if err := srv.ListenAndServe(); err != nil {
        logger.Error("serve", "err", err)
    }
}
Enter fullscreen mode Exit fullscreen mode

Two non-obvious choices in there.

First, modernc.org/sqlite is a pure-Go SQLite driver. No CGO required, which means we can cross-compile this on a laptop and scp the binary to the LiteSpeed box without dragging libsqlite headers or build toolchains onto the production host. That matters for hosts that are intentionally minimal.

Second, the database is opened read-only with a five-second busy timeout. The exporter must never block the PHP writer that runs the hourly fetch cron. SQLite in WAL mode handles concurrent readers and a single writer gracefully, but a careless write from the exporter side would create exactly the kind of lock contention that takes a site down at 2am.

The probe itself

This is where the YouTube quirks live. The cheapest, most reliable signal for "is this video viewable from region X" is the public oEmbed endpoint. It returns 200 for viewable videos, 401 for private, 404 or 410 for deleted, and 403 for region-restricted. It does not require an API key and it does not count against our quota.

type probeResult struct {
    label    string
    duration time.Duration
}

var httpClient = &http.Client{
    Timeout: 8 * time.Second,
    Transport: &http.Transport{
        MaxIdleConns:        50,
        MaxIdleConnsPerHost: 25,
        IdleConnTimeout:     90 * time.Second,
    },
}

func probeVideo(ctx context.Context, videoID, region string) probeResult {
    start := time.Now()
    url := "https://www.youtube.com/oembed?format=json&url=https%3A//www.youtube.com/watch%3Fv%3D" + videoID
    req, _ := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
    req.Header.Set("User-Agent", "ViralVidVault-Probe/1.0 (+https://viralvidvault.com/bots)")
    req.Header.Set("Accept-Language", regionToAcceptLanguage(region))

    resp, err := httpClient.Do(req)
    dur := time.Since(start)
    if err != nil {
        if ctx.Err() != nil {
            return probeResult{"timeout", dur}
        }
        return probeResult{"network_error", dur}
    }
    defer resp.Body.Close()

    switch resp.StatusCode {
    case 200:
        return probeResult{"ok", dur}
    case 401:
        return probeResult{"private", dur}
    case 403:
        return probeResult{"geoblocked", dur}
    case 404, 410:
        return probeResult{"unavailable", dur}
    default:
        return probeResult{"http_other", dur}
    }
}

func runProbeLoop(ctx context.Context, logger *slog.Logger, db *sql.DB) {
    sem := make(chan struct{}, 16)
    ticker := time.NewTicker(2 * time.Second)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            rows, err := db.QueryContext(ctx, `
                SELECT video_id, region
                FROM videos
                WHERE last_probed_at IS NULL
                   OR last_probed_at < strftime('%s','now') - 3600
                ORDER BY last_probed_at ASC NULLS FIRST
                LIMIT 64
            `)
            if err != nil {
                logger.Error("queue query", "err", err)
                continue
            }
            for rows.Next() {
                var id, region string
                if err := rows.Scan(&id, &region); err != nil {
                    continue
                }
                sem <- struct{}{}
                go func(id, region string) {
                    defer func() { <-sem }()
                    r := probeVideo(ctx, id, region)
                    probeTotal.WithLabelValues(region, r.label).Inc()
                    probeDuration.WithLabelValues(region).Observe(r.duration.Seconds())
                }(id, region)
            }
            rows.Close()
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

A few things that took us pain to learn.

  • Bounded semaphore over a worker pool. Sixteen in-flight probes is enough to cover an eight-region catalog without saturating YouTube's per-IP throttling. We hit 429 responses above thirty-two concurrent connections from the same egress IP.
  • NULLS FIRST ordering. SQLite needs this explicit clause; otherwise nulls sort at the end and brand-new videos starve while the loop chews through old ones forever.
  • Probe interval per video is one hour. Anything tighter is wasted work because geoblocks rarely flip on a sub-hour cadence, and we burn through goodwill with YouTube. Anything wider and our staleness SLO suffers.
  • The Accept-Language header genuinely changes the response. YouTube serves different availability for the same video ID depending on the negotiated locale, which is exactly what we want when probing as if we were a Polish or German user.

Writing probe results back to SQLite

The exporter only reads from SQLite. Writes go through a tiny PHP CLI script that the Go process pipes JSON lines into via stdin. Why split it this way? Because our PHP code already owns the schema migrations, the triggers, and the indexing logic. Splitting writes into Go would force us to duplicate all of that, and any future schema change would need touching two languages.

<?php
declare(strict_types=1);

// vvv-record-probe.php - consumes newline-delimited JSON from stdin
// {"video_id":"abc123","region":"DE","result":"geoblocked","ts":1716636000}

require __DIR__ . '/../app/Database.php';

$db = Database::connect();
$db->exec('PRAGMA journal_mode=WAL');
$db->exec('PRAGMA synchronous=NORMAL');

$stmt = $db->prepare(
    'UPDATE videos
        SET last_probed_at = :ts,
            last_probe_result = :result,
            strike_count = CASE WHEN :result = "ok" THEN 0 ELSE strike_count + 1 END
      WHERE video_id = :id AND region = :region'
);

$stdin = fopen('php://stdin', 'rb');
$batch = 0;
$db->beginTransaction();

while (($line = fgets($stdin)) !== false) {
    $row = json_decode($line, true, 8, JSON_THROW_ON_ERROR);
    $stmt->execute([
        ':ts'     => $row['ts'],
        ':result' => $row['result'],
        ':id'     => $row['video_id'],
        ':region' => $row['region'],
    ]);

    if (++$batch % 200 === 0) {
        $db->commit();
        $db->beginTransaction();
    }
}

$db->commit();
fwrite(STDERR, "wrote $batch probe rows\n");
Enter fullscreen mode Exit fullscreen mode

We batch in transactions of two hundred because SQLite WAL handles bursty writes well, but each individual COMMIT does an fsync. Without batching, sixty-four probes per tick times eight regions becomes more than five hundred fsyncs per second and the disk stops being responsive for anything else.

The Go side just shells out the JSON lines into the PHP process via exec.Command and a piped stdin. It is unfashionable in 2026 to use process pipes instead of a gRPC bus, but the entire write path is twelve lines of code, it survives crashes cleanly because PHP commits in batches, and the schema-of-record stays in one language.

Catalog gauges and the lag metric

The probe counter tells us what is happening right now. The catalog gauges tell us how the population is shifting over time, and the lag gauge is the early warning signal we actually page on.

Every thirty seconds, a separate goroutine runs SELECT region, COALESCE(last_probe_result, 'unprobed') AS status, COUNT(*) FROM videos GROUP BY region, status, calls catalogSize.Reset() to clear the family, and re-populates it from the result set. A second query computes lag as strftime('%s','now') - MIN(COALESCE(last_probed_at, 0)) per region.

The Reset() call before re-populating the gauge family is important. If a region disappears from the catalog because we drop it from rotation, its old series would otherwise persist in the registry forever and pollute dashboards with stale data. New operators always miss this step and then wonder why their gauges never go to zero.

Alert rules that actually fire

Most teams write alerts on individual error counts and get paged at 3am for nothing. We page on rates with sustained windows, and we differentiate symptoms from causes.

groups:
  - name: video-probes
    interval: 30s
    rules:
      - alert: VideoUnavailableSpike
        expr: |
          sum by (region) (rate(viralvidvault_video_probe_total{result="unavailable"}[15m]))
          /
          sum by (region) (rate(viralvidvault_video_probe_total[15m]))
          > 0.08
        for: 20m
        labels:
          severity: warning
        annotations:
          summary: "Region{{$labels.region}}unavailablerateabove8%"
          description: "Sustainedfor20m-likelyanuploaderpurgeorAPIshift."

      - alert: ProbeLagBudgetExhausted
        expr: viralvidvault_video_probe_lag_seconds > 7200
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Probequeueismorethan2hoursbehindin{{$labels.region}}"

      - alert: ExporterDown
        expr: up{job="vvv-probe"} == 0
        for: 5m
        labels:
          severity: critical
Enter fullscreen mode Exit fullscreen mode

The eight percent threshold is empirical. In steady state we sit around one and a half percent unavailable per region, and a viral cycle on TikTok or Instagram Reels - where European uploaders cross-post and delete the YouTube version twelve hours later - can push us briefly to four or five percent. Eight percent sustained for twenty minutes is real and worth a page.

The lag alert is the one I trust most. If the probe loop falls behind by more than two hours, our entire feed quality degrades regardless of what the individual probes report, because we are making editorial decisions on stale data.

Edge-side reality check from Cloudflare Workers

Server-side probing tells us what our datacenter sees. But our European users hit Cloudflare, and the Workers fronting the watch page can fail differently because of geo-routing weirdness. We added a small Worker that pings the oEmbed endpoint from each Cloudflare colo on a one-minute cron and pushes a client-side viability signal back via Pushgateway.

export default {
  async scheduled(event, env, ctx) {
    const sample = await env.VIDEO_KV.get("hot_sample_de", { type: "json" });
    if (!sample) return;

    const results = await Promise.all(
      sample.ids.map(async (id) => {
        const r = await fetch(
          `https://www.youtube.com/oembed?url=https%3A//youtu.be/${id}`,
          { cf: { cacheTtl: 0 } }
        );
        return r.status === 200 ? 1 : 0;
      })
    );

    const rate = results.reduce((a, b) => a + b, 0) / results.length;
    await fetch(`${env.PUSHGATEWAY}/metrics/job/edge_probe/region/DE`, {
      method: "POST",
      body: `viralvidvault_edge_viability ${rate}\n`,
    });
  }
};
Enter fullscreen mode Exit fullscreen mode

When this gauge diverges from the server-side counter for more than thirty minutes, we know the issue is geo-routing rather than the videos themselves. Usually a Cloudflare colo is serving stale DNS for youtube.com or i.ytimg.com. That signal alone has saved us two false-positive incident reviews this year, including one where we were about to blame our recommendation algorithm for a Cloudflare PoP issue in Frankfurt.

GDPR notes that nobody else writes about

Because we are a European-facing site, probes have privacy implications people forget.

  • The probe is server-to-YouTube with no user PII involved. The logged fields are video ID, region code, HTTP status, duration. None of this is personal data under GDPR Article 4.
  • The User-Agent string identifies us with a contact URL pointing to our bots policy page. We get fewer Google abuse flags when crawl ops can reach a human.
  • Probe records are kept for thirty days, then aggregated into daily counts and the raw rows are dropped. We do not need raw probe history beyond debugging windows, and storing less is always cheaper than defending why you stored more.
  • The Cloudflare Worker runs in scheduled context, not in any user request path, so it cannot accidentally read cookies or KV bound to user sessions. This separation matters when the DPO asks whether your monitoring touches user data.

Document this on your /privacy page even if your DPO does not ask first. It comes up in audits, and writing it after the fact is much harder than writing it now.

What this caught in production

Six weeks of operating this exporter caught three real issues we would have missed.

  1. A two-hour outage where a single Polish ISP was returning forged 403 responses for YouTube oEmbed. The exporter spiked the geoblock rate to twenty-two percent for PL only, the regional alert fired, and we paused PL rotation for the window until the ISP fixed their middlebox.
  2. A schema drift where about eighteen hundred videos in our DE pool had stale region codes from a previous fetch run. The catalog gauges showed an unexpected dip in ok count, which drove us to a one-line migration fix that nobody would have found from logs.
  3. A YouTube API change that started returning 429 on the oEmbed endpoint above thirty requests per second. We caught it in the duration histogram - p99 jumped from 200ms to 8s - before any user actually saw a degraded feed. We dropped our concurrency cap from twenty to sixteen and the histogram returned to baseline within one scrape interval.

Wrap-up

The whole exporter is around three hundred and fifty lines of Go, runs in eighteen megabytes of resident memory next to LiteSpeed, and produces enough signal to drive both real-time pages and weekly editorial reviews. If you operate a video site of any size, the cost of building this is one afternoon, and it pays itself back the first time YouTube changes something silently. Start with two metrics - a probe counter and a duration histogram - and resist the urge to add labels you cannot justify with a real query. Cardinality is a one-way door.

Source: dev.to

arrow_back Back to Tutorials