Of parsing decimal integer representations, instruction-level parallelism, short-circuiting overflow detection and AVX2. (#300)
Data is an inherently metaphysical description of state whose many forms are unified by contextually appropriate understanding thereof. Whereas the most blatant display of this contextual dependence is observed when not shielding representations from ignorant physical tampering, one must only think of someone cutting a punched card in two arbitrarily and gluing both parts next to each other in reverse order, these forms can also exist embedded in themselves metaphysical arrangements.
Such is the case in the stringly-typed Web, where lines between ordinals in all their Platonicity, registers etched in silicon and every-day written word blur, a uint64 indistinguishable from the 8-bit-a-byte ASCII encoding of a minimal-length decimal. No wonder that in this environment both the minutia of what it means to be a string tightly coupled with permeability of form become a fertile ground to eke out unrestingly drying up drops of performance.
Touching string implementations is intricate with expected pay-offs somewhere around −1 % (Ormrod 2016, i), where close-knit communication between the entire system’s allocator is essential. (Alexandrescu 2012)
GCC 5.1 chose in 2015 (GNU 2024) to employ small buffer optimization, here called short-string optimization and henceforth abbreviated to SSO, for libstdc++ (GNU 2025), which optimizes for the case of many string objects being live at once, where cache page misses start to outweigh the bulkier string handles’ cache line occupations. (Ormrod 2016, iii) Despite these tangled considerations at play when opting for SSO, many implementations’ convergence on it has prompted some authors to portray it as decisively superior. (Stroustrup 2018, i)
Being too provocative with one’s design here also risks heavy instability burdens (Ormrod 2016, ii). All of the above paired with the open question of how the itself independently evolving Go memory management story (Knyszek and Clements 2025) would interplay with SSO are likely factors The Go Project’s flagship compiler’s resolute avoidance of this implementation strategy. (Cox “comment on Go issue 18894” 2017)
Less daunting than replacing the string backing, the translation step between textual decimals and register-sized uint64s offers more promising opportunity for “the silicion [not to] stay dark” (Alexandrescu 2012, i): implementations based on a left-to-right reading order, like Go 1.26.1’s str
Attempting to write an instruction-level-parallelism-friendly, henceforth abbreviated to ILP-friendly, problem presentation for str
“If s is empty or contains invalid digits, err.Err = ErrSyntax (...)” str
// Cf. https://jfrech.de/blog/300 (accessed 2026-04-06)
package main
import (
"fmt"
"strconv"
"strings"
)
func main() {
x, err := strconv.ParseUint(strings.Repeat("9", 20)+"nine", 10, 64)
fmt.Println(x)
fmt.Println(err)
}
// go version go1.26.1 linux/amd64
// Output:
// 18446744073709551615
// strconv.ParseUint: parsing "99999999999999999999nine": value out of range
Overflow reporting takes priority over reading the full string (Go “internal/strconv” 2026), myopically declaring a whole slew of syntactical malformations as out of range. Introduced in late 2008, with over seventeen years of age, this behaviour is over three years older than Go 1. One commit prior, invalid digits were always rejected but overflow detection was still stubbed. (Cox “lib/strconv” 2008)
There isn’t much use of str
Json/
It is also highly data-dependent, as is str
From a correctness point of view, I find it astonishing how an often-touched (Griesemer “comment on Go issue 31197” 2019) (Neil “Go issue 46641” 2021) (Ulen “Go issue 21275” 2017) (Ulen “Go issue 21278” 2017), assumedly deliberately straight-forwardly-written routine’s undescribed behaviour eschewed notice for over seventeen years, with a witness test case quietly sitting but a few packages down the street.
I’m hesitant about wanting this behaviour changed. On the one hand, it to me looks like an intricately confusing bug where I see no plausibly useful dependence on. On the other hand, I’ve gotten too jaded with backwards incompatibility to myself exclaim “let’s fix the bug.” (Cox “comment on Go issue 21278” 2017) See Go issue 78546, which tracks that question. (F. “Go issue 78546” 2026)
Fuzzing differentially against the standard library was the technique that enriched a mere exercise in applying loop unrolling and ILP-friendly operation order (Alexandrescu 2012) into a semantical verification of a part of the Go standard library.
Apart from the introductorily mentioned application of AVX2 intrinsics, the two classical performance techniques I used are representing the computation in a way favourable for ILP and loop unrolling, both together named Pdui64_
benchstat -col /implementation@alpha -filter '.name:"Pdui64" AND .fullname:/Std-|b:Unroll_Appr02-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
│ a:Std │ b:Unroll_Appr02 │
│ sec/op │ sec/op vs base │
Pdui64-8 37.06m ± 0% 19.99m ± 0% -46.06% (p=0.000 n=30)
GOEXPERIMENT=simd go doc -src Pdui64_Unroll_Appr02
x := 0 +
uint64(digits[0]-'0')*10000000000000000000 +
uint64(digits[1]-'0')*1000000000000000000 +
uint64(digits[2]-'0')*100000000000000000 +
uint64(digits[3]-'0')*10000000000000000 +
uint64(digits[4]-'0')*1000000000000000 +
uint64(digits[5]-'0')*100000000000000 +
uint64(digits[6]-'0')*10000000000000 +
uint64(digits[7]-'0')*1000000000000 +
uint64(digits[8]-'0')*100000000000 +
uint64(digits[9]-'0')*10000000000 +
uint64(digits[10]-'0')*1000000000 +
uint64(digits[11]-'0')*100000000 +
uint64(digits[12]-'0')*10000000 +
uint64(digits[13]-'0')*1000000 +
uint64(digits[14]-'0')*100000 +
uint64(digits[15]-'0')*10000 +
uint64(digits[16]-'0')*1000 +
uint64(digits[17]-'0')*100 +
uint64(digits[18]-'0')*10 +
uint64(digits[19]-'0')*1 +
0
Whereas I at first blush assumed the two to be compounding (Alexandrescu 2012), separating Pdui64_
benchstat -col /implementation@alpha -filter '.name:"Pdui64" AND .fullname:/Std-|d:Ilponly_Appr02/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
│ a:Std │ d:Ilponly_Appr02 │
│ sec/op │ sec/op vs base │
Pdui64-8 37.06m ± 0% 26.00m ± 0% -29.84% (p=0.000 n=30)
GOEXPERIMENT=simd go doc -src Pdui64_Ilponly_Appr02
for j := range len(s) {
c := s[len(s)-1-j]
if !('0' <= c && c <= '9') {
return 0, false
}
x += uint64(c-'0') * pow10
pow10 *= 10
}
Pre-computing powers of ten on the stack, as Pdui64_Ilponly_Appr00 does, performs worse than computing them alongside parsing.
benchstat -col /implementation@alpha -filter '.name:"Pdui64" AND .fullname:/Std-|e:Unrollonly_Appr00/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
│ a:Std │ e:Unrollonly_Appr00 │
│ sec/op │ sec/op vs base │
Pdui64-8 37.06m ± 0% 19.35m ± 1% -47.78% (p=0.000 n=30)
GOEXPERIMENT=simd go doc -src Pdui64_Unrollonly_Appr00
x = x*10 + uint64(digits[0]-'0')
x = x*10 + uint64(digits[1]-'0')
x = x*10 + uint64(digits[2]-'0')
x = x*10 + uint64(digits[3]-'0')
x = x*10 + uint64(digits[4]-'0')
x = x*10 + uint64(digits[5]-'0')
x = x*10 + uint64(digits[6]-'0')
x = x*10 + uint64(digits[7]-'0')
x = x*10 + uint64(digits[8]-'0')
x = x*10 + uint64(digits[9]-'0')
x = x*10 + uint64(digits[10]-'0')
x = x*10 + uint64(digits[11]-'0')
x = x*10 + uint64(digits[12]-'0')
x = x*10 + uint64(digits[13]-'0')
x = x*10 + uint64(digits[14]-'0')
x = x*10 + uint64(digits[15]-'0')
x = x*10 + uint64(digits[16]-'0')
x = x*10 + uint64(digits[17]-'0')
x = x*10 + uint64(digits[18]-'0')
x = x*10 + uint64(digits[19]-'0')
Curiously, a very branch-free but highly data-dependent implementation turns out to be most performant.
One property of Go’s standard compiler that shines in this analysis is its reserved stance on optimization. Whereas with a modern C or C++ compiler, one couldn’t as directly relate source-level optimization techniques to the CPU’s view of the program, global rewrites here don’t happen by themselves. One can see Pdui64_
GOEXPERIMENT=simd go test -o /tmp/bin -c && go tool objdump -S -s Pdui64_Unroll_Appr02 /tmp/bin
uint64(digits[0]-'0')*10000000000000000000 +
0x5a9c3b 0fb6c9 MOVZX CL, CX
0x5a9c3e 49bf0000e8890423c78a MOVQ $0x8ac7230489e80000, R15
0x5a9c48 490fafcf IMULQ R15, CX
uint64(digits[1]-'0')*1000000000000000000 +
0x5a9c4c 0fb6d2 MOVZX DL, DX
0x5a9c4f 49bf000064a7b3b6e00d MOVQ $0xde0b6b3a7640000, R15
0x5a9c59 4c0faffa IMULQ DX, R15
uint64(digits[0]-'0')*10000000000000000000 +
0x5a9c5d 4c01f9 ADDQ R15, CX
uint64(digits[2]-'0')*100000000000000000 +
0x5a9c60 400fb6d6 MOVZX SI, DX
0x5a9c64 48be00008a5d78456301 MOVQ $0x16345785d8a0000, SI
0x5a9c6e 480faff2 IMULQ DX, SI
uint64(digits[1]-'0')*1000000000000000000 +
0x5a9c72 4801f1 ADDQ SI, CX
GOEXPERIMENT=simd go test -o /tmp/bin -c && go tool objdump -S -s Pdui64_Unrollonly_Appr00 /tmp/bin
x = x*10 + uint64(digits[17]-'0')
0x5ab3ab 4801c9 ADDQ CX, CX
0x5ab3ae 488d0c89 LEAQ 0(CX)(CX*4), CX
0x5ab3b2 0fb6542432 MOVZX 0x32(SP), DX
0x5ab3b7 0fb6d2 MOVZX DL, DX
0x5ab3ba 4801d1 ADDQ DX, CX
x = x*10 + uint64(digits[18]-'0')
0x5ab3bd 4801c9 ADDQ CX, CX
0x5ab3c0 488d0c89 LEAQ 0(CX)(CX*4), CX
0x5ab3c4 0fb6542431 MOVZX 0x31(SP), DX
0x5ab3c9 0fb6d2 MOVZX DL, DX
0x5ab3cc 4801d1 ADDQ DX, CX
x = x*10 + uint64(digits[19]-'0')
0x5ab3cf 4801c9 ADDQ CX, CX
0x5ab3d2 488d0c89 LEAQ 0(CX)(CX*4), CX
0x5ab3d6 0fb6d3 MOVZX BL, DX
0x5ab3d9 488d040a LEAQ 0(DX)(CX*1), AX
if len(s) == len(maxUint64) && (s[0] != '1' || x < 1e19) {
0x5ab3dd 488b4c2450 MOVQ 0x50(SP), CX
0x5ab3e2 4883f914 CMPQ CX, $0x14
0x5ab3e6 7527 JNE 0x5ab40f
0x5ab3e8 488b4c2448 MOVQ 0x48(SP), CX
0x5ab3ed 803931 CMPB 0(CX), $0x31
0x5ab3f0 7513 JNE 0x5ab405
0x5ab3f2 48b90000e8890423c78a MOVQ $0x8ac7230489e80000, CX
0x5ab3fc 0f1f4000 NOPL 0(AX)
0x5ab400 4839c1 CMPQ CX, AX
0x5ab403 760a JBE 0x5ab40f
Loop unrolling is an ancient technique anecdotically the bedrock of PostScript’s market dominance from 1984 onwards (Brailsford 2016), used with great effect still nearly thirty years later (Alexandrescu 2012) and whose performance advantages I was able to show when fiddling with str
Nevertheless, varying execution characteristics for, from within the language’s semantics, η-equivalent programs constitutes a drippingly leaky abstraction, firmly nestled at the equivocal interplay betwixt reality’s physicality and comprehensibility. (van Hardenberg 2022, i) Its reign as a go-to optimization technique is moreover confined to the higher ends of computing, with sufficient level-one instruction caches cushioning a bloated program text. When stepping outside these parameters, the heuristics fall apart, and without vigilant system-wide integration testing may culminate in weighty performance penalizations, such as was likely the case thirty years ago with the Nintendo ® 64 ™’s launch title. (Emanuar 2024)
With SIMD resounding through the lands the past couple of months (Boreham 2025) (Knyszek and Clements 2025), I had teetered on the edge of hand-writing some .s for quite some while when reading about GO
GOEXPERIMENT=simd go doc -src Pdui64_Avx2_Appr05
package pdui // import "."
func Pdui64_Avx2_Appr05(s string) (x uint64, ok bool) {
switch {
case !archsimd.X86.AVX2():
panic("simd/archsimd.X86Features.AVX2 isn't supported")
case len(s) == 0 || len(s) > 20:
return 0, false
case s[0] == '0':
return 0, s == "0"
case len(s) == 20:
if s > "18446744073709551615" {
return 0, false
}
fallthrough
default:
switch {
case len(s) <= 16:
return pdui64_Avx2_Appr05_16(s)
default:
lo, loOk := pdui64_Avx2_Appr05_16(s[len(s)-16:])
hi, hiOk := pdui64_Avx2_Appr05_4(s[:len(s)-16])
if !loOk || !hiOk {
return 0, false
}
return uint64(hi)*1_0000_0000_0000_0000 + lo, true
}
}
}
GOEXPERIMENT=simd go doc -src -u pdui64_Avx2_Appr05_16
package pdui // import "jfrech.de/blog/300/pdui"
func pdui64_Avx2_Appr05_16(s string) (x uint64, ok bool) {
if len(s) > 16 {
panic("impossible")
}
var raw = [16]byte{
'0', '0', '0', '0', '0', '0', '0', '0',
'0', '0', '0', '0', '0', '0', '0', '0',
}
copy(raw[len(raw)-len(s):],
unsafe.Slice(unsafe.StringData(s), len(s)))
ascii := archsimd.LoadUint8x16(&raw)
a := ascii.Sub(archsimd.BroadcastUint8x16('0'))
if a.GreaterEqual(archsimd.BroadcastUint8x16(10)).
ToBits() != 0 {
return 0, false
}
var b archsimd.Uint32x8 = a.AsInt8x16().
ExtendToInt16().DotProductPairs(
int16x16_thousandHundredTenOne).AsUint32x8()
var c archsimd.Uint32x8 = b.Mul(uint32x8_tenthousandOne)
lo := c.GetLo()
hi := c.GetHi()
var d [8]uint32
lo.Store((*[4]uint32)(d[0:4]))
hi.Store((*[4]uint32)(d[4:8]))
return uint64(d[0]+d[1]+d[2]+d[3])*1_0000_0000 +
uint64(d[4]+d[5]+d[6]+d[7]), true
}
GOEXPERIMENT=simd go doc -src -u int16x16_thousandHundredTenOne | grep -i int16x16_thousandHundredTenOne -A 5
int16x16_thousandHundredTenOne = archsimd.LoadInt16x16(new([16]int16{
1000, 100, 10, 1,
1000, 100, 10, 1,
1000, 100, 10, 1,
1000, 100, 10, 1,
}))
GOEXPERIMENT=simd go doc -src -u uint32x8_tenthousandOne | grep -i uint32x8_tenthousandOne -A 3
uint32x8_tenthousandOne = archsimd.LoadUint32x8(new([8]uint32{
1_0000, 1_0000, 1, 1,
1_0000, 1_0000, 1, 1,
}))
GOEXPERIMENT=simd go doc -src -u pdui64_Avx2_Appr05_4
package pdui // import "jfrech.de/blog/300/pdui"
func pdui64_Avx2_Appr05_4(s string) (x uint16, ok bool) {
if len(s) > 4 {
panic("impossible")
}
if len(s) < 1 || len(s) > 16 {
panic("impossible")
}
var raw = [4]byte{
'0', '0', '0', '0',
}
copy(raw[len(raw)-len(s):],
unsafe.Slice(unsafe.StringData(s), len(s)))
if !('0' <= raw[0] && raw[0] <= '9' &&
'0' <= raw[1] && raw[1] <= '9' &&
'0' <= raw[2] && raw[2] <= '9' &&
'0' <= raw[3] && raw[3] <= '9') {
return 0, false
}
return 1000*uint16(raw[0]-'0') +
100*uint16(raw[1]-'0') +
10*uint16(raw[2]-'0') +
1*uint16(raw[3]-'0'), true
}
Yet after days of coming up with various AVX2-based Pdui64 implementations, I couldn’t overtake the standard library in performance, with my best-performing Pdui64_
benchstat -col /implementation@alpha -filter '.name:"Pdui64" AND .fullname:/Std-|c:Avx2_Appr05-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
│ a:Std │ c:Avx2_Appr05 │
│ sec/op │ sec/op vs base │
Pdui64-8 37.06m ± 0% 41.37m ± 0% +11.66% (p=0.000 n=30)
One could ask if Pdui64_
After having first switched gears from Pdui64 to Pdui32, as AVX2’s register size felt more equipped to handle parsing of only ten digits, fitting into an uint8x16 register, I was befuddled that all my implementation attempts performed within ±0.01 % of each other. Confused by this precise a match, I cooked up wild speculations about the microcode optimizer possibly fully comprehending the problem and thus behaving equally across all representations I had provided. Alas, I had tested Pdui32 against the string representations of uniformly chosen 64-bit integers, effectively only hitting the common code path of rejecting long strings.
Benchmarking over string representations of uniformly chosen 32-bit integers shows AVX2 intrinsics to here be beneficial.
benchstat -col /implementation@alpha -filter '.name:"Pdui32" AND .fullname:/Std-|b:Unroll_Appr00-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
│ a:Std │ b:Unroll_Appr00 │
│ sec/op │ sec/op vs base │
Pdui32-8 22.25m ± 0% 16.15m ± 0% -27.38% (p=0.000 n=30)
benchstat -col /implementation@alpha -filter '.name:"Pdui32" AND .fullname:/Std-|c:Avx2_Appr11-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
│ a:Std │ c:Avx2_Appr11 │
│ sec/op │ sec/op vs base │
Pdui32-8 22.25m ± 0% 13.13m ± 0% -40.97% (p=0.000 n=30)
Since a data-dependent right shift of an entire SIMD register doesn’t seem feasible, cf. Pdui64_1<<64-1 nor 1<<32-1 have as their most-significant decimal digit the digit nine, causing overflows when not calculating in a bit width one above the parsée. As such, implementing Pdui64 is off the table in both AVX2 and AVX-512. 64-bit wide multiplication simd/
Anyhow, Pdui32_
benchstat -col /implementation@alpha -filter '.name:"Pdui32" AND .fullname:/Std-|c:Avx2_Appr05-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
│ a:Std │ c:Avx2_Appr05 │
│ sec/op │ sec/op vs base │
Pdui32-8 22.25m ± 0% 51.31m ± 0% +130.65% (p=0.000 n=30)
GOEXPERIMENT=simd go doc -src Pdui32_Avx2_Appr05
package pdui // import "."
func Pdui32_Avx2_Appr05(s string) (uint32, bool) {
if s == "" ||
s[0] == '0' && len(s) != 1 ||
len(s) > len(maxUint32) ||
len(s) == len(maxUint32) && s > maxUint32 {
return 0, false
}
n := len(s)
a := archsimd.LoadUint8x16SlicePart(
unsafe.Slice(unsafe.StringData(s), n))
b := a.PermuteOrZero(int8x16_reverse)
if b.Sub(archsimd.BroadcastUint8x16('0')).
Greater(archsimd.BroadcastUint8x16(9)).
ToBits()>>(16-n) != 0 {
return 0, false
}
c := b.SubSaturated(archsimd.BroadcastUint8x16('0'))
d := c.ExtendToUint16()
e := d.Mul(
uint16x16_1_10_1_10_1_10_1_10_1_10_100_1000_1_10_100_1000)
f := e.AddPairsGrouped(e)
g := f.AddPairsGrouped(f)
h0, h1 := g.GetLo(), g.GetHi()
i := h0.InterleaveLo(h1)
j := i.ExtendLo4ToUint32()
var z [4]uint32
j.Store(&z)
x := uint64(z[3])*1_00_0000 + uint64(z[2]+z[1]*1_00)
for range 10 - len(s) {
x /= 10
}
return uint32(x), true
}
In light of these results, my Pdui64 implementations not benefitting from AVX2 over multiple approaches as opposed to Pdui32, folly::to<unsigned long long> (Folly 2012, i) in 2012 not being founded on SIMD acceleration (Alexandrescu 2012, ii) might be due to AVX-512 not being proposed for over a year (Reinders 2013) after Folly’s public release (Alexandrescu 2012, iii) and only hitting the market in Q 4 of 2016 (Intel ® n. d.). Then again, Go and C++ might be incomparable enough to not warrant translating the possibility of AVX-512 aiding Pdui64, extrapolated from AVX2 having aided Pdui32. After all, present-day folly::to<unsigned long long> (Folly 2026, i) doesn’t appear to make use of SIMD, either.
Final benchmarks were compiled with Go 1.26.1 and run on Debian ® 13.3 with Linux ® 6.12.74 from a cold boot without a graphical environment running on an Intel ® Core ™ i7-4790K, which was released in Q 2 of 2014 and whose instruction set architecture, short ISA, includes 256-bit-wide AVX2 but excludes 512-bit-wide AVX-512.
However, benchmark runs of varying parameters running on a system in graphical use and unclear state influenced implementations and analysis, constituting a bias of unknown extent.
Benchmarks’ input distribution were a million uniformly-picked bitness-appropriate integers’ decimal string forms. Benchmarks were shuffled, with cross-run-contamination unlikely to be present:
benchstat -col /implementation@alpha -filter '.name:"Pdui64" AND .fullname:/Std-|a:Std_shuffledIn-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
│ a:Std │ a:Std_shuffledIn │
│ sec/op │ sec/op vs base │
Pdui64-8 37.06m ± 0% 36.97m ± 1% ~ (p=0.504 n=30)
benchstat -col /implementation@alpha -filter '.name:"Pdui32" AND .fullname:/Std-|a:Std_shuffledIn-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
│ a:Std │ a:Std_shuffledIn │
│ sec/op │ sec/op vs base │
Pdui32-8 22.25m ± 0% 22.21m ± 0% ~ (p=0.063 n=30)
Pdui implementations are BSD-licenced, available either through https://
Out of this analysis fell noticing a semantically relevant documentation-implementation divergence for str
I wasn’t able to get a performance edge over classical methods through AVX2 in Pdui64, which is what I had initially set out to find, leaving me with the feeling that totalling up a wide register isn’t a task that gains from AVX2; pariwise add operations have to be repeated to calculate sums multiple times in parallel, of which all but one get immediately discarded. Even Pdui32 only benefitted from the syntactical check being written as a wide register compare, the calculation’s core remaining classical.
What surprised me is the delicate interplay between conceptual-lexical ILP-friendliness and unrolling; both have clearly observable performance-favourable implications in isolation, yet they do not in general compound. That is to say, thinking in those macro descriptions of CPU behaviour can lead one to untapped performance, though the total lack of a unifying theory inescapably shows through. All benchmarks this text is based on were run on a used decade-old processor; I have little doubt they are not representative of wide classes of hardware, with a likely fruitful next step being to look at how AVX-512 instructions or different non-x86 ISAs alltogether fair.