SIMD in GO

go dev.to

SIMD in Go — Cryptographic XOR Benchmark

Demonstrates Go 1.26's simd/archsimd package applied to the core operation behind stream ciphers (AES-CTR, ChaCha20, OTP): XOR of a plaintext buffer with a keystream.

What's benchmarked

Implementation Strategy Instructions
XORScalar Byte-by-byte loop XOR r8, r8
XORSimd256 32 bytes/iteration via AVX2 VPXOR ymm, ymm, ymm
XORSimd256Unrolled 128 bytes/iteration (4× unrolled) VPXOR ymm per iter

Run (macOS with Docker)

./run.sh
Enter fullscreen mode Exit fullscreen mode

Expected speedup

On x86_64 (emulated via Docker on Apple Silicon, real numbers will be higher on native):

  • Small payloads (256B): ~5-10× faster with SIMD
  • Large payloads (1MB+): ~15-25× faster with SIMD (memory-bandwidth limited)

On native x86 hardware (e.g. Emerald Rapids Xeon), expect even better numbers since VPXOR has 1-cycle latency and can retire 3 per cycle on ports p0/p1/p5.

uops.info reference

For the presentation, show the Emerald Rapids measurements for:

  • VPXOR (YMM, YMM, YMM) — the instruction behind Uint8x32.Xor() - link
  • VAESENC (XMM, XMM,

Source: dev.to

arrow_back Back to Tutorials