Node Turns Waiting Into Events. Go Moves Context Switching Into User Space.

Most discussions of TypeScript/Node vs Go concurrency stop at the surface: Node is async, Go is threaded. That framing isn't wrong — it just isn't deep enough to be useful when you're picking a runtime, debugging a tail-latency problem, or explaining to your team why one of the services keeps falling over under CPU load.

The real difference is not async vs threaded. It's a question about where, in the system, suspended work lives — and what shape it takes when it's resumed.

tl;dr — Both Node and Go refuse to let the CPU sit idle while a request waits on I/O. They disagree on the unit of scheduling. Node's unit is the continuation — the tail of an async function captured as a heap closure. Go's unit is the goroutine — a full call stack the runtime can suspend and resume in user space. That single decision cascades into every other property of each runtime.

The Wrong Question

"Async vs threaded" is the wrong frame because it makes you think the choice is between paradigms. It isn't. Both runtimes have already made the same fundamental decision: do not block an OS thread waiting for slow external work. The interesting choice is how they implement that.

The actually useful question is:

When a request is waiting for I/O — for a database, an HTTP call, a Redis round-trip, a file read — what does the CPU do, and where does the suspended state of that request live?

Once you frame it that way, Node and Go aren't opposites. They're two answers to the same question — and each answer cascades into a different language shape, a different library style, and a different failure mode under load.

The naive blocking model answers the question with "an OS thread waits for the syscall to return." That model collapses around a few thousand concurrent connections — memory per thread, scheduler overhead, kernel context-switch cost. By 40,000 connections you're out of RAM, not CPU. Node and Go both refuse to do this. They diverge on which resource gets freed up and how the suspended work is captured for later resumption.

Node's Answer: Turn Waiting Into an Event

Node's model can be summarized in one line: the JS main thread only executes code that's already ready to run.

Look at this:

const user = await db.getUser(id);
return user;

It reads as if the function is paused, blocking on the database. It isn't. Here's what V8 actually does at the bytecode level when it compiles an async function: it rewrites the body into a state machine, with each await becoming a state transition.

The function above gets transformed into something equivalent to:

function asyncFn() {
  const promise = new Promise((resolve) => {
    let state = 0;
    const closure = {};                  // heap object holding locals

    function step(value) {
      switch (state) {
        case 0:
          state = 1;
          db.getUser(id).then(step);     // await → register continuation
          return;                         // ← function POPS here
        case 1:
          closure.user = value;           // resume: locals live in closure
          resolve(closure.user);
          return;
      }
    }
    step();
  });
  return promise;
}

Three things to notice:

await is not a pause. It's the point at which V8 returns from the function and pops the JS stack frame. The "rest of the function" is captured as a continuation registered on the awaited Promise via .then.
Local variables move to the heap. Because the stack frame is gone, locals (user here) live in a heap closure, accessible only when the state machine resumes.
Each await slices the function into another state. A function with two awaits runs in three event-loop turns, with three independently-pushed JS frames, with all live state stored in heap closures between them.

That third point is the most non-obvious. A single async function is not one unit of execution — it's a sequence of fresh frames separated by event-loop turns:

There is no "paused" function. There are only captured continuations and fresh frames that resume them. The event loop is the dispatcher: it watches for I/O readiness via libuv, for resolved Promises (via V8's microtask queue), for timers — and pulls the corresponding continuation onto the JS thread when it's ready to run. One thread can manage tens of thousands of concurrent connections, because at any moment only a handful of them have work that's actually ready.

This is event-driven concurrency in its precise sense — the runtime turns "waiting" into a registered event, and only resumes the captured continuation when the event fires.

The Visible Side Effect: Function Color

Because the suspension point has to be marked at compile time, async-ness becomes part of the function's type. A function that does I/O returns Promise<T>. Its callers must await it. Once they await, they themselves return Promise<T>. The "color" propagates up the call stack until you hit an async-aware entry point — typically the top of an HTTP handler or the event loop itself.

Bob Nystrom named this the function color problem in 2015. It's not a notation choice — it's a logical consequence of the stackless coroutine model. V8 cannot save and restore arbitrary JS call stacks. The only way to express suspension is "return a Promise and be marked async," and once one function does that, every function on the way up has to do the same.

The Hard Limit

The model fails the moment your code stops waiting. A single CPU-bound operation:

while (true) { /* heavy work */ }

…holds the JS main thread, and every other request on this process is dead until it returns. The event loop has nowhere else to go. Worker threads, child processes, or splitting CPU work into a separate service are real fixes, but they're escape hatches — they exist because the core model has only one main thread executing JS, and there is exactly one of it.

Go's Answer: Move Context Switching Into User Space

Go writes synchronous code:

user := db.GetUser(id)
sendResponse(user)

There is no await. There is no callback. The function looks like it blocks on the database. And yet the program scales to hundreds of thousands of concurrent operations on modest hardware.

The trick is that the scheduling boundary has been moved. Where Node has the programmer mark the suspension point with await and the runtime captures a continuation, Go lets the programmer write straight-line code and has the runtime suspend the entire goroutine when it hits a blocking I/O call.

This is the central insight, and the cleanest one-line statement of Go's concurrency model:

Go's essence is the user-space-ification of context switching.

A goroutine isn't an OS thread. It's a small (initially 2 KB) growable stack and a register snapshot, managed by the Go runtime. The runtime maps a large number of goroutines (G) onto a small number of OS threads (M) using scheduling contexts (P). This is the GMP model:

G — a goroutine. The unit of scheduling. Cheap to create, cheap to suspend.
M — an OS thread. Usually only GOMAXPROCS of them.
P — a scheduling context. Decides which G runs on which M.

many G  →  Go scheduler  →  few M  →  CPU cores

When a goroutine hits a blocking syscall or a channel wait, the Go runtime suspends the goroutine — saves its stack and registers — detaches it from the current M, and schedules another runnable goroutine onto that M. When the original goroutine's wait completes, it's marked runnable again, and some M eventually picks it up and resumes execution from the suspension point. None of this enters the kernel. No clone(2), no kernel-mediated thread switch, no kernel scheduler queue. The bookkeeping is all in user space.

That's the user-space-ification. The CPU still has to switch contexts when work shifts between goroutines, but the cost is roughly a function call plus a stack swap — not a kernel-mediated thread switch.

The key contrast with Node's model is in where the suspended state lives:

In Node, the JS call stack is shared and almost always near-empty — every async function in flight has already popped, with its state sitting in a heap closure. In Go, every goroutine owns its full call chain on its own heap-allocated stack; suspended goroutines look like frozen frames waiting for the runtime to resume them on some OS thread.

This is also why neither language can simply borrow the other's model. Node runs on V8, which was designed in 2008 for browser JS — single call stack, synchronous semantics, no concept of saving stacks across yields. Adding stackful coroutines would mean rewriting the engine, which is roughly what Java's Project Loom did to the JVM at huge cost. Go was designed from scratch with a runtime that owns stacks, can grow them, and can save them. The choice is locked in by runtime architecture, not language taste.

What "User-Space" Actually Buys You

The slogan only matters if user-space context switching is meaningfully cheaper than the kernel-mediated kind. It is — by more than an order of magnitude.

Two goroutines pinned to one OS thread (GOMAXPROCS=1), ping-ponging via runtime.Gosched() and via an unbuffered channel. Two pthreads pinned to one core (taskset -c 0), ping-ponging via pthread_mutex + pthread_cond. (Reproduction code at the end of the post.)

Measured on Intel N100, Ubuntu 24.04 (kernel 6.8.0), Go 1.23.4, gcc 13.3:

Operation	ns / switch
Goroutine yield (`runtime.Gosched`, GOMAXPROCS=1)	~102 ns
Goroutine round-trip via unbuffered channel	~436 ns (≈218 ns per G-switch + channel coordination)
pthread switch (mutex+cond ping-pong, single core)	~2,900 ns (range 2,818–3,611 across 5 runs of 2M iterations)

Ratio: roughly 28× cheaper for the bare scheduler yield, ~13× cheaper for the apples-to-apples synchronized round-trip.

Where the gap comes from:

Mode switch. The user → kernel → user round-trip alone is ~100 ns of entry/exit and ABI-mandated register save/restore. A goroutine switch never crosses that line.
Scheduler work in kernel space. Linux CFS maintains a red-black tree of runnable threads with locked, cross-CPU runqueues. The Go scheduler does the same job in user space with per-P local runqueues and lock-free fast paths — and skips the kernel locks entirely.
Cache and TLB effects. A kernel scheduler may migrate a thread to a different core, costing you cold L1/L2 and an instruction-cache reload. Goroutines normally stay on the same M, so the cache stays warm.

What the model does not buy you: a goroutine that makes a real blocking syscall still pays for a real OS thread switch — the runtime detaches the G from its M and may spin up another M so the rest of the goroutines keep running. Async preemption (Go 1.14+, signal-based) is the runtime's answer to tight loops that never yield, and it has its own cost. Once you saturate GOMAXPROCS, the user-space runqueue itself starts to show up in profiles.

The "user-space-ification" buys you cheap G-to-G switching on a hot M. That's where the order-of-magnitude lives. The syscalls, the M-to-M handoffs, the actual kernel work — those are still as expensive as they always were. The model wins by making the common case — many concurrent goroutines, mostly waiting, occasionally running — almost free.

(N100 is a low-power Alder Lake-N E-core; absolute numbers will be smaller on a server-class Xeon or EPYC, but the ratio is expected to hold.)

The Unit of Scheduling

The cleanest comparison is to ask what each runtime actually schedules:

Node / TypeScript	Go
Unit of scheduling	callback / Promise continuation	goroutine
What's captured at suspension	tail of an async function as a heap closure	full call stack + registers
How code looks	explicit `async`/`await`	straight-line synchronous
Suspension marked by	the programmer (`await`)	the runtime (any blocking op)
Suspended state lives in	V8 microtask queue + heap closure	goroutine stack on the user-space heap
Kernel involvement	epoll/kqueue/IOCP via libuv	epoll/kqueue/IOCP via netpoller
CPU parallelism	one main JS thread; needs workers/cluster for cores	M:N scheduler runs goroutines across cores natively
Function color	yes (Promise infects up the call stack)	no (any function may block)
What breaks under CPU load	the entire event loop	nothing — scheduler runs another G on another M

The two columns describe deeply different mental models, but they belong to the same family. They are both user-space concurrency runtimes that avoid kernel thread-per-request. They differ in where the suspension is captured (the language vs. the call stack) and how broad the scheduler's mandate is.

Where the Boundaries Diverge: CPU-Bound Work

Node and Go look interchangeable on I/O-bound workloads. They diverge sharply the moment CPU work enters the picture.

Node's event loop has one job: dispatch ready callbacks onto a single JS thread. If a callback runs for 200 ms doing JSON parsing or hashing, the loop is frozen for those 200 ms. Every other suspended continuation has to wait. Throughput collapses.

Go's runtime has a different mandate. It doesn't only manage waiting — it also manages execution. If you spawn:

go task1()
go task2()
go task3()

…the scheduler is happy to put each goroutine on a different M, run them on different cores in true parallel, and preempt long-running goroutines so they don't starve the rest of the runtime. CPU-bound goroutines aren't a special case to work around. They're just goroutines.

That's why Go's concurrency model covers more ground:

Node's model mainly solves non-CPU-bound concurrency — network I/O, database waits, downstream API calls. Go's model solves I/O waiting and CPU parallelism with the same primitive.

This isn't a knock on Node. The event loop is brilliant at what it's designed for: lots of slow waits, light per-request CPU. It's the natural shape of API gateways, BFFs, websocket hubs, real-time aggregation, and most of the JSON-shuffling that makes up modern web backends. But sustained CPU work, mixed CPU + I/O pipelines, long-lived infrastructure services — those are workloads where Go's scheduler-driven model has more headroom built in.

Two Answers to the Same Question

Strip away the implementation details and the two runtimes are answering the same question with different abstractions:

Concurrency at scale is the problem of what to do with the CPU while a request waits on I/O.

Node's answer: turn the wait into an event, capture the rest of the function as a continuation, resume the continuation when the event fires. One thread cycling through ready continuations.

Go's answer: run the request on a goroutine, suspend the goroutine in user space when it blocks, schedule another runnable goroutine onto the OS thread, resume the original when its wait completes.

Two ways of solving the same waste. One state-machines it. The other lowers the cost of context switching far enough that you can afford to keep one execution flow per request.

Two answers to one question: one is events, implemented as a state machine. The other is low-cost user-space context switching.

But there's a deeper layer worth surfacing. The two answers also disagree about whether suspension should be visible in the type system. Node says yes — Promise<T> is part of the signature, async is part of the contract, function color propagates. Go says no — any function may block, and the type doesn't carry that information.

This visibility-vs-uniformity trade-off shows up far beyond Node and Go. It's the same shape as monadic IO vs implicit IO in Haskell, checked vs unchecked exceptions in Java, capability-based security vs ambient authority. Each pair makes the same trade: composable static reasoning vs ergonomic uniform code. Node and Go are picking sides of a much bigger question.

You see the consequence in the libraries. Node libraries publish fs.readFile and fs.readFileSync, two retry helpers (one for sync ops, one for async), p-limit-style bounded-concurrency wrappers around Promise.all. Go libraries publish os.ReadFile (one function), one Retry(op func() error, n int) error, twenty lines of chan + WaitGroup for bounded concurrency. The Go versions aren't simpler because Go developers are smarter — they're simpler because the runtime hides the same complexity that Node's type system insists on exposing.

The Closing Line

If you remember one thing from this:

Node turns waiting into events. Go turns execution flows into schedulable units. Both refuse to let the CPU sit idle while I/O blocks — they just disagree on what the unit of scheduling should be.

Or, if you want the deeper layer:

Node makes "this function might suspend" visible at the type level. Go makes it invisible.

That's the whole story. Everything else — await vs go, libuv vs the netpoller, V8's microtask queue vs GMP, single-thread bottleneck vs CPU-bound resilience, libraries that look complicated vs libraries that look simple — falls out of that one disagreement.

Appendix: Reproduce the Benchmark

goroutine_switch_test.go — GOMAXPROCS=1 go test -bench=. -benchtime=5s -count=5:

package bench

import (
    "runtime"
    "sync"
    "testing"
)

// Channel ping-pong: each iter is a full round-trip = 2 G-switches.
func BenchmarkGoroutineSwitchChannel(b *testing.B) {
    ch := make(chan struct{})
    done := make(chan struct{})
    go func() {
        for {
            select {
            case <-done:
                return
            case <-ch:
                ch <- struct{}{}
            }
        }
    }()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        ch <- struct{}{}
        <-ch
    }
    b.StopTimer()
    close(done)
}

// Bare scheduler yield. Each iter ≈ 1 G-switch.
func BenchmarkGoroutineSwitchGosched(b *testing.B) {
    var wg sync.WaitGroup
    wg.Add(1)
    half := b.N / 2
    go func() {
        for i := 0; i < half; i++ {
            runtime.Gosched()
        }
        wg.Done()
    }()
    b.ResetTimer()
    for i := 0; i < half; i++ {
        runtime.Gosched()
    }
    wg.Wait()
}

pthread_switch.c — gcc -O2 -o pthread_switch pthread_switch.c -lpthread && taskset -c 0 ./pthread_switch 2000000:

#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>
static pthread_mutex_t mu  = PTHREAD_MUTEX_INITIALIZER;
static pthread_cond_t  cv  = PTHREAD_COND_INITIALIZER;
static volatile int    turn = 0;
static long            iters;

static void *worker(void *arg) {
    int my_turn = (int)(intptr_t)arg;
    pthread_mutex_lock(&mu);
    for (long i = 0; i < iters; i++) {
        while (turn != my_turn) pthread_cond_wait(&cv, &mu);
        turn = 1 - my_turn;
        pthread_cond_broadcast(&cv);
    }
    pthread_mutex_unlock(&mu);
    return NULL;
}

static double now_ns(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (double)ts.tv_sec * 1e9 + (double)ts.tv_nsec;
}

int main(int argc, char **argv) {
    iters = (argc > 1) ? atol(argv[1]) : 1000000L;
    pthread_t t0, t1;
    double start = now_ns();
    pthread_create(&t0, NULL, worker, (void *)(intptr_t)0);
    pthread_create(&t1, NULL, worker, (void *)(intptr_t)1);
    pthread_join(t0, NULL); pthread_join(t1, NULL);
    double end = now_ns();
    printf("ns / switch: %.1f\n", (end - start) / (2.0 * iters));
    return 0;
}

GOMAXPROCS=1 forces both goroutines onto the same M so we measure pure G-to-G switching, not cross-core migration. taskset -c 0 pins both pthreads to one CPU so they actually have to context-switch (otherwise they run in parallel on two cores and there is nothing to measure). Both benches do the simplest possible synchronized hand-off — no I/O, no real work — so what is left is the cost of the switch itself.