SIMD in Go — Cryptographic XOR Benchmark

Demonstrates Go 1.26's simd/archsimd package applied to the core operation behind stream ciphers (AES-CTR, ChaCha20, OTP): XOR of a plaintext buffer with a keystream.

What's benchmarked

Implementation	Strategy	Instructions
`XORScalar`	Byte-by-byte loop	`XOR r8, r8`
`XORSimd256`	32 bytes/iteration via AVX2	`VPXOR ymm, ymm, ymm`
`XORSimd256Unrolled`	128 bytes/iteration (4× unrolled)	4× `VPXOR ymm` per iter

Run (macOS with Docker)

./run.sh

Expected speedup

On x86_64 (emulated via Docker on Apple Silicon, real numbers will be higher on native):

Small payloads (256B): ~5-10× faster with SIMD
Large payloads (1MB+): ~15-25× faster with SIMD (memory-bandwidth limited)

On native x86 hardware (e.g. Emerald Rapids Xeon), expect even better numbers since VPXOR has 1-cycle latency and can retire 3 per cycle on ports p0/p1/p5.

uops.info reference

For the presentation, show the Emerald Rapids measurements for:

VPXOR (YMM, YMM, YMM) — the instruction behind Uint8x32.Xor() - link
VAESENC (XMM, XMM,…

Source: dev.to