Using Apache AGE for Graph-Based Video Relationship Queries at Scale

Last quarter I hit a wall. Our recommendation query on TrendVidStream had grown into a 140-line SQL monstrosity with seven self-joins on a video_relations table, three correlated subqueries, and a GROUP BY clause that made the SQLite query planner cry. The goal was simple: given a video the user just watched in Singapore, find videos that are (a) watched by viewers who also watched this one, (b) from channels in the same regional cluster, (c) within two hops of a topical tag the user has engaged with, and (d) currently trending in at least one of our eight regions. In relational terms, that is a multi-hop traversal with filtering at every hop. In graph terms, it is a single Cypher query.

I run TrendVidStream, a multi-region video streaming discovery site that pulls from eight YouTube regional pools (US, GB, JP, KR, TW, SG, VN, TH, HK rotating) and serves PHP 8.4 over LiteSpeed with SQLite FTS5 as the primary store. SQLite is fantastic for read-heavy workloads, but graph traversals are not its sweet spot. I spent two weekends moving the relationship layer to Apache AGE — the PostgreSQL extension that adds openCypher to a relational database — and the rewrite turned a 380ms query into a 22ms one. This post is what I learned, including the parts that bit me.

Why Graph and Why AGE Specifically

The video discovery problem is fundamentally a graph problem. Videos have channels, channels have regions, videos have tags, viewers have sessions, sessions have co-watch edges. When you ask "what should I recommend next," you are walking edges, not scanning tables. I evaluated three options:

Neo4j: Mature, but adds a second database to deploy. My FTP-based deploy pipeline pushes to four shared LiteSpeed hosts. Running a JVM on shared hosting is not happening.
Memgraph: Faster than Neo4j on paper, but same operational overhead.
Apache AGE: PostgreSQL extension. I already run Postgres for the analytics warehouse that backs the cron jobs. Adding AGE is a CREATE EXTENSION away.

AGE won on operational simplicity. It is not the fastest graph database, but it lets me write Cypher against the same Postgres instance where the rest of my analytics live, which means I can JOIN graph results against regular tables in one query. That turned out to be a bigger deal than raw traversal speed.

The other thing I underestimated: AGE returns results as agtype, which is JSONB-flavored. You can cast graph query results directly into rows that PHP's PDO driver consumes without any client-side graph library. No extra dependency in composer.json.

The Schema That Replaced Seven JOINs

The relational schema I started with had videos, channels, tags, video_tags, video_relations (with a relation_type discriminator column), view_sessions, and session_videos. The graph version collapses this to four node labels and four edge types.

-- Bootstrap AGE on a fresh Postgres database
CREATE EXTENSION IF NOT EXISTS age;
LOAD 'age';
SET search_path = ag_catalog, "$user", public;

SELECT create_graph('tvs_graph');

-- Node labels
SELECT create_vlabel('tvs_graph', 'Video');
SELECT create_vlabel('tvs_graph', 'Channel');
SELECT create_vlabel('tvs_graph', 'Tag');
SELECT create_vlabel('tvs_graph', 'Region');

-- Edge labels
SELECT create_elabel('tvs_graph', 'PUBLISHED_BY');
SELECT create_elabel('tvs_graph', 'TAGGED');
SELECT create_elabel('tvs_graph', 'CO_WATCHED');
SELECT create_elabel('tvs_graph', 'TRENDS_IN');

-- Indexes on the property store (this matters a lot)
CREATE INDEX video_id_idx ON tvs_graph."Video"
  USING gin ((properties));
CREATE INDEX video_external_id_idx ON tvs_graph."Video"
  ((properties->>'youtube_id'));
CREATE INDEX channel_region_idx ON tvs_graph."Channel"
  ((properties->>'region'));

The properties column on each label table holds the node's properties as JSONB. Without the properties->>'youtube_id' btree index, Cypher MATCH clauses that filter by external ID do a full scan of the label table. I missed this on my first deploy and saw 800ms latencies that I had not seen in dev because dev only had 12k nodes. Production had 2.3M.

Loading Data From SQLite Into AGE

My cron jobs already write video metadata to SQLite. I needed a sync job that propagates inserts and edge changes to the graph. I wrote it in PHP since that's what the rest of the cron stack speaks, and it runs as STEP 6.5 in fetch_videos.php after the indexing step:

<?php
// cron/sync_graph.php — incremental sync SQLite -> AGE
declare(strict_types=1);

require __DIR__ . '/../app/bootstrap.php';

final class GraphSync {
    public function __construct(
        private \PDO $sqlite,
        private \PDO $pg,
        private string $graph = 'tvs_graph',
    ) {
        $this->pg->exec("LOAD 'age'");
        $this->pg->exec("SET search_path = ag_catalog, public");
    }

    public function syncVideosSince(int $sinceTs): int {
        $stmt = $this->sqlite->prepare(
            'SELECT youtube_id, title, channel_id, region, view_count, published_at
             FROM videos WHERE updated_at >= :ts'
        );
        $stmt->execute([':ts' => $sinceTs]);

        $count = 0;
        $this->pg->beginTransaction();

        $cypher = $this->pg->prepare(<<<SQL
            SELECT * FROM cypher('tvs_graph', \$\$
                MERGE (v:Video {youtube_id: \$id})
                SET v.title = \$title,
                    v.view_count = \$views,
                    v.region = \$region,
                    v.published_at = \$published
                WITH v
                MERGE (c:Channel {channel_id: \$channel})
                MERGE (v)-[:PUBLISHED_BY]->(c)
                RETURN v.youtube_id
            \$\$, \$1) AS (id agtype)
        SQL);

        while ($row = $stmt->fetch(\PDO::FETCH_ASSOC)) {
            $params = json_encode([
                'id'        => $row['youtube_id'],
                'title'     => $row['title'],
                'views'     => (int) $row['view_count'],
                'region'    => $row['region'],
                'published' => $row['published_at'],
                'channel'   => $row['channel_id'],
            ], JSON_THROW_ON_ERROR);

            $cypher->execute([$params]);
            $count++;

            if ($count % 500 === 0) {
                $this->pg->commit();
                $this->pg->beginTransaction();
            }
        }

        $this->pg->commit();
        return $count;
    }
}

$sync = new GraphSync(db_sqlite(), db_postgres());
$since = (int) (file_get_contents(STATE_DIR . '/last_graph_sync') ?: 0);
$n = $sync->syncVideosSince($since);
file_put_contents(STATE_DIR . '/last_graph_sync', (string) time());
fprintf(STDERR, "synced %d nodes\n", $n);

Three things from this code that took me a while to figure out:

Parameters go through agtype casting. You pass them as a single JSON object in $1, and AGE deserializes them. You cannot do $1, $2, $3 like normal Postgres. This is annoying but consistent.
MERGE is your friend. It is the upsert primitive. Using CREATE would fail on the second run; using MATCH then CREATE is two round-trips.
Commit every 500 rows. AGE holds graph locks during a transaction. A 50k-node sync in one transaction will block reads. I learned this the hard way during a Friday afternoon backfill that made the homepage hang.

The Query That Started All This

Here is the recommendation query that motivated the whole migration. The user just watched video dQw4w9WgXcQ in the SG region. We want twelve recommendations ranked by a co-watch score, restricted to channels in the same regional cluster (SG, HK, TW, JP), with at least one tag overlap, and currently trending in any region.

SELECT * FROM cypher('tvs_graph', $$
    MATCH (seed:Video {youtube_id: 'dQw4w9WgXcQ'})
          -[:TAGGED]->(t:Tag)
          <-[:TAGGED]-(candidate:Video)
          -[:PUBLISHED_BY]->(c:Channel)
    WHERE c.region IN ['SG', 'HK', 'TW', 'JP']
      AND candidate.youtube_id <> 'dQw4w9WgXcQ'
    WITH seed, candidate, count(DISTINCT t) AS tag_overlap
    MATCH (candidate)-[cw:CO_WATCHED]-(seed)
    WITH candidate, tag_overlap, cw.weight AS cowatch_weight
    MATCH (candidate)-[:TRENDS_IN]->(r:Region)
    WITH candidate,
         tag_overlap,
         cowatch_weight,
         count(DISTINCT r) AS trending_regions
    RETURN candidate.youtube_id AS id,
           candidate.title AS title,
           tag_overlap,
           cowatch_weight,
           trending_regions,
           (tag_overlap * 2 + cowatch_weight + trending_regions) AS score
    ORDER BY score DESC
    LIMIT 12
$$) AS (id agtype, title agtype, tag_overlap agtype,
        cowatch_weight agtype, trending_regions agtype, score agtype);

The relational equivalent was the 140-line monster I mentioned. The Cypher version is 22 lines, reads top-to-bottom like a recipe, and the query planner actually understands what I am asking for. On a 2.3M-node graph with 18M edges, this runs in 22ms cold and 8ms warm. The old SQL ran in 380ms warm and would occasionally spike to 1.2 seconds when SQLite picked the wrong join order.

One thing about that query: notice the two WITH clauses. They are the Cypher equivalent of CTEs, and they let the planner project down the result set early. Without the first WITH seed, candidate, count(DISTINCT t), the second MATCH clause would expand the candidate set across all co-watch edges before filtering, and you would see a 10x slowdown. Cypher gives you the rope; you have to know where to tie it off.

Edge Weight Maintenance Without a Lock Storm

The CO_WATCHED edge has a weight property that gets incremented every time two videos appear in the same session. This is the kind of write that, naively implemented, will lock half your graph. My first attempt did one MERGE per pair per session and ground to a halt at around 80 sessions per second.

The fix was batching in Go, which I use for the session ingestion service that sits in front of the analytics writer:

package cowatch

import (
    "context"
    "database/sql"
    "fmt"
    "strings"
    "time"
)

type Pair struct {
    A, B   string
    Weight int
}

type Batcher struct {
    db       *sql.DB
    pending  map[[2]string]int
    lastFlush time.Time
}

func NewBatcher(db *sql.DB) *Batcher {
    return &Batcher{db: db, pending: make(map[[2]string]int)}
}

func (b *Batcher) Add(a, c string) {
    if a > c {
        a, c = c, a // canonical ordering — edge is undirected
    }
    b.pending[[2]string{a, c}]++
}

func (b *Batcher) Flush(ctx context.Context) error {
    if len(b.pending) == 0 {
        return nil
    }

    var pairs []string
    for key, weight := range b.pending {
        pairs = append(pairs,
            fmt.Sprintf(`{"a":"%s","b":"%s","w":%d}`,
                key[0], key[1], weight))
    }

    query := fmt.Sprintf(`
        SELECT * FROM cypher('tvs_graph', $$
            UNWIND $batch AS row
            MATCH (v1:Video {youtube_id: row.a})
            MATCH (v2:Video {youtube_id: row.b})
            MERGE (v1)-[e:CO_WATCHED]-(v2)
            ON CREATE SET e.weight = row.w
            ON MATCH SET e.weight = e.weight + row.w
            RETURN count(e)
        $$, $1) AS (n agtype);`)

    params := `{"batch":[` + strings.Join(pairs, ",") + `]}`
    if _, err := b.db.ExecContext(ctx, query, params); err != nil {
        return fmt.Errorf("flush co-watch batch: %w", err)
    }

    b.pending = make(map[[2]string]int)
    b.lastFlush = time.Now()
    return nil
}

The canonical ordering on edge endpoints (if a > c, swap) is the trick that prevents duplicate edges from getting created when sessions arrive in different orders. AGE does not have a MERGE variant that recognizes an undirected edge regardless of endpoint order — you have to enforce ordering in your code. I batch flushes every 30 seconds or every 1000 pairs, whichever comes first. Lock contention is now a non-issue.

What Breaks and How I Caught It

A week after rolling this to production I noticed our SG region was serving recommendations that included a few Russian-language videos. The graph had no notion of language, only region. Channels published from accounts registered in SG can publish in any language, and a few RU creators had set up SG-registered channels to game the regional algorithm.

The fix was adding a language property to the Video node during sync, and a WHERE clause in the recommendation query. But the deeper lesson was about graph schema discipline: every property you do not model becomes a property you cannot filter on later. I now have a checklist when adding a new node type:

What properties do I need for filtering today?
What properties might I need for filtering in 6 months?
What edges connect this to existing labels?
What is the cardinality of those edges? (CO_WATCHED is dense; PUBLISHED_BY is sparse.)

For dense edges, you need to think about pruning. We keep CO_WATCHED edges with a weight of 1 for only 14 days; anything still at weight 1 after 14 days gets deleted. Otherwise the graph balloons with noise.

SELECT * FROM cypher('tvs_graph', $$
    MATCH ()-[e:CO_WATCHED]-()
    WHERE e.weight = 1
      AND e.created_at < datetime() - duration({days: 14})
    DELETE e
$$) AS (e agtype);

This runs as a weekly cron job and typically deletes 200-400k stale edges. The graph size has been stable for two months now.

Backup and Disaster Recovery

One thing I want to flag because it bit a friend who deployed AGE around the same time: pg_dump does not dump AGE graphs by default. The graph data lives in tables under the ag_catalog schema, and you need to explicitly include it.

# Wrong — misses graph data
pg_dump -h prod-db -U tvs tvs_prod > backup.sql

# Right — includes ag_catalog
pg_dump -h prod-db -U tvs \
  --schema=ag_catalog \
  --schema=tvs_graph \
  --schema=public \
  tvs_prod > backup.sql

I also keep a logical replication slot to a standby Postgres that mirrors both schemas. Recovery from the standby is about 4 minutes for our 2.3M-node graph. I tested this once. Test yours too.

What I Would Do Differently

If I were starting over, I would not put the entire video metadata in the graph. I would keep video properties in regular Postgres tables and use the graph only for IDs and edges. AGE is great at traversals but mediocre at returning large property sets. The pattern that works best is: traverse in Cypher, get back a list of IDs, then JOIN those IDs against a regular relational table to hydrate the properties.

WITH ids AS (
    SELECT (id #>> '{}')::text AS video_id
    FROM cypher('tvs_graph', $$
        MATCH (v:Video)-[:TAGGED]->(:Tag {name: 'documentary'})
        RETURN v.youtube_id LIMIT 50
    $$) AS (id agtype)
)
SELECT v.* FROM video_metadata v
JOIN ids ON v.youtube_id = ids.video_id;

This pattern keeps the hot path in the graph small and pushes the wide rows to the relational side, where Postgres indexes and TOAST do their normal thing. My p99 query latency dropped another 30% after this refactor.

Conclusion

Apache AGE is not a Neo4j replacement. It is a pragmatic choice when you already have Postgres and need graph queries as a small fraction of your workload. For TrendVidStream, it replaced the worst query in the codebase with one that reads like English, runs 17x faster, and is easier to extend. The operational cost was zero new daemons, one extension, and about a weekend of schema design.

The big wins were not the speed numbers. They were the queries I now write that I would never have attempted before — three-hop tag traversals, channel cluster analysis, cross-region trending overlap — because they finally feel cheap. When a query is cheap, you ask better questions.

If you are running a smallish Postgres and have a relationship-heavy slice of your domain, AGE is worth a Saturday afternoon. Just remember the property indexes, the batch flushing, the schema-aware backups, and the canonical edge ordering. Those four things will save you my weekends.