Of parsing decimal integer representations, instruction-level parallelism, short-circuiting overflow detection and AVX2. (#300)

Jonathan Frech, 6 April 2026

Data is an inherently metaphysical description of state whose many forms are unified by contextually appropriate understanding thereof. Whereas the most blatant display of this contextual dependence is observed when not shielding representations from ignorant physical tampering, one must only think of someone cutting a punched card in two arbitrarily and gluing both parts next to each other in reverse order, these forms can also exist embedded in themselves metaphysical arrangements.
Such is the case in the stringly-typed Web, where lines between ordinals in all their Platonicity, registers etched in silicon and every-day written word blur, a uint64 indistinguishable from the 8-bit-a-byte ASCII encoding of a minimal-length decimal. No wonder that in this environment both the minutia of what it means to be a string tightly coupled with permeability of form become a fertile ground to eke out unrestingly drying up drops of performance.

Touching string implementations is intricate with expected pay-offs somewhere around −1⁠ ⁠% (Ormrod 2016, i), where close-knit communication between the entire system’s allocator is essential. (Alexandrescu 2012)
GCC 5.1 chose in 2015 (GNU 2024) to employ small buffer optimization, here called short-string optimization and henceforth abbreviated to SSO, for libstdc⁠+⁠+ (GNU 2025), which optimizes for the case of many string objects being live at once, where cache page misses start to outweigh the bulkier string handles’ cache line occupations. (Ormrod 2016, iii) Despite these tangled considerations at play when opting for SSO, many implementations’ convergence on it has prompted some authors to portray it as decisively superior. (Stroustrup 2018, i)
Being too provocative with one’s design here also risks heavy instability burdens (Ormrod 2016, ii). All of the above paired with the open question of how the itself independently evolving Go memory management story (Knyszek and Clements 2025) would interplay with SSO are likely factors The Go Project’s flagship compiler’s resolute avoidance of this implementation strategy. (Cox “comment on Go issue 18894” 2017)

Less daunting than replacing the string backing, the translation step between textual decimals and register-sized uint64s offers more promising opportunity for “the silicion [not to] stay dark” (Alexandrescu 2012, i): implementations based on a left-to-right reading order, like Go 1.26.1’s strconv.ParseUint, released 2026-03-05 (Go “1.26.1 release” 2026), present the computation of the dot product between a number’s digits and successive powers of ten in a highly data-dependent way, meshing poorly with current-day CPUs’ heuristics.
Attempting to write an instruction-level-parallelism-friendly, henceforth abbreviated to ILP-friendly, problem presentation for strconv.ParseUint, through both spelling out the up-to-twenty-dimensional dot product statically and loop unrolling, lead me both to discovering a subtle mismatch between strconv.ParseUint’s documentation present since Go’s inception as well as an over −40⁠ ⁠% performance gain in a micro-benchmark over uniformly distributed uint64s’ decimal representations. Restricting the problem and benchmark to 32 bits, AVX2 intrinsics allowed for an again over −40⁠ ⁠% improvement over strconv.ParseUint(x, 10, 32), albeit only the syntactical check benefitting from AVX instruction VPCMPGTW (Go “Uint16x8.Greater” 2026), used indirectly through unsigned comparison emulation.

“If s is empty or contains invalid digits, err.Err = ErrSyntax (...)” strconv.ParseUint claims of its behaviour (Go “strconv” 2026), though in the history of Go 1 this has never been the case:

// Cf. https://jfrech.de/blog/300 (accessed 2026-04-06)
package main

import (
	"fmt"
	"strconv"
	"strings"
)

func main() {
	x, err := strconv.ParseUint(strings.Repeat("9", 20)+"nine", 10, 64)
	fmt.Println(x)
	fmt.Println(err)
}

// go version go1.26.1 linux/amd64
// Output:
// 18446744073709551615
// strconv.ParseUint: parsing "99999999999999999999nine": value out of range

Overflow reporting takes priority over reading the full string (Go “internal/strconv” 2026), myopically declaring a whole slew of syntactical malformations as out of range. Introduced in late 2008, with over seventeen years of age, this behaviour is over three years older than Go 1. One commit prior, invalid digits were always rejected but overflow detection was still stubbed. (Cox “lib/strconv” 2008)

There isn’t much use of strconv.ParseUint or its signed sibling inside of the standard library, which was unexpected. Most packages just roll their own, not bothering with bit sizes, bases or underscores. One such locally-written implementation is encoding/json/v2’s encoding/json/internal/jsonwire.ParseUint, which even has a unit test case for the behaviour of an overflowing decimal integer representation followed by syntactically illegal text (Go “jsonwire test” 2025); had two equally-named implementations shared their test cases, strconv.ParseUint’s documentation-implementation discrepancy might have been unearthed then and there.
Json/encoding/v2’s json/encoding/internal/jsonwire.ParseUint’s overflow check comes close to classifying as golfed: the condition “b[0] !⁠= '1' |⁠| v < 1⁠e⁠19” (Go “jsonwire” 2025) combines an integral-valued untyped floating-point constant’s ability to adopt the guise of a uint64 with certainty about overflow behaviour.
It is also highly data-dependent, as is strconv.ParseUint. My over −40⁠ ⁠% faster, as reported by a micro-benchmark, implementation in hand, I looked at its performance integrated in a larger system, namely the encoding/json/v2 benchmarking test suite. Yet the results are inconclusive: jsonv2.bench.txtar

From a correctness point of view, I find it astonishing how an often-touched (Griesemer “comment on Go issue 31197” 2019) (Neil “Go issue 46641” 2021) (Ulen “Go issue 21275” 2017) (Ulen “Go issue 21278” 2017), assumedly deliberately straight-forwardly-written routine’s undescribed behaviour eschewed notice for over seventeen years, with a witness test case quietly sitting but a few packages down the street.

I’m hesitant about wanting this behaviour changed. On the one hand, it to me looks like an intricately confusing bug where I see no plausibly useful dependence on. On the other hand, I’ve gotten too jaded with backwards incompatibility to myself exclaim “let’s fix the bug.” (Cox “comment on Go issue 21278” 2017) See Go issue 78546, which tracks that question. (F. “Go issue 78546” 2026)

Fuzzing differentially against the standard library was the technique that enriched a mere exercise in applying loop unrolling and ILP-friendly operation order (Alexandrescu 2012) into a semantical verification of a part of the Go standard library.

Apart from the introductorily mentioned application of AVX2 intrinsics, the two classical performance techniques I used are representing the computation in a way favourable for ILP and loop unrolling, both together named Pdui64_Unroll_*, which achieves speeds at over −40⁠ ⁠%:

benchstat -col /implementation@alpha -filter '.name:"Pdui64" AND .fullname:/Std-|b:Unroll_Appr02-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
         │    a:Std    │           b:Unroll_Appr02           │
         │   sec/op    │   sec/op     vs base                │
Pdui64-8   37.06m ± 0%   19.99m ± 0%  -46.06% (p=0.000 n=30)

GOEXPERIMENT=simd go doc -src Pdui64_Unroll_Appr02
        x := 0 +
                uint64(digits[0]-'0')*10000000000000000000 +
                uint64(digits[1]-'0')*1000000000000000000 +
                uint64(digits[2]-'0')*100000000000000000 +
                uint64(digits[3]-'0')*10000000000000000 +
                uint64(digits[4]-'0')*1000000000000000 +
                uint64(digits[5]-'0')*100000000000000 +
                uint64(digits[6]-'0')*10000000000000 +
                uint64(digits[7]-'0')*1000000000000 +
                uint64(digits[8]-'0')*100000000000 +
                uint64(digits[9]-'0')*10000000000 +
                uint64(digits[10]-'0')*1000000000 +
                uint64(digits[11]-'0')*100000000 +
                uint64(digits[12]-'0')*10000000 +
                uint64(digits[13]-'0')*1000000 +
                uint64(digits[14]-'0')*100000 +
                uint64(digits[15]-'0')*10000 +
                uint64(digits[16]-'0')*1000 +
                uint64(digits[17]-'0')*100 +
                uint64(digits[18]-'0')*10 +
                uint64(digits[19]-'0')*1 +
                0

Whereas I at first blush assumed the two to be compounding (Alexandrescu 2012), separating Pdui64_Unroll, which includes ILP, out into Pdui64_Ilponly and Pdui64_Unrollonly fails to show their performance gains adding up, with ILP alone at under −30⁠ ⁠% and unrolling alone out-competing unrolled ILP-friendly:

benchstat -col /implementation@alpha -filter '.name:"Pdui64" AND .fullname:/Std-|d:Ilponly_Appr02/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
         │    a:Std    │          d:Ilponly_Appr02           │
         │   sec/op    │   sec/op     vs base                │
Pdui64-8   37.06m ± 0%   26.00m ± 0%  -29.84% (p=0.000 n=30)

GOEXPERIMENT=simd go doc -src Pdui64_Ilponly_Appr02
        for j := range len(s) {
                c := s[len(s)-1-j]
                if !('0' <= c && c <= '9') {
                        return 0, false
                }
                x += uint64(c-'0') * pow10
                pow10 *= 10
        }

Pre-computing powers of ten on the stack, as Pdui64_Ilponly_Appr00 does, performs worse than computing them alongside parsing.

benchstat -col /implementation@alpha -filter '.name:"Pdui64" AND .fullname:/Std-|e:Unrollonly_Appr00/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
         │    a:Std    │         e:Unrollonly_Appr00         │
         │   sec/op    │   sec/op     vs base                │
Pdui64-8   37.06m ± 0%   19.35m ± 1%  -47.78% (p=0.000 n=30)

GOEXPERIMENT=simd go doc -src Pdui64_Unrollonly_Appr00
        x = x*10 + uint64(digits[0]-'0')
        x = x*10 + uint64(digits[1]-'0')
        x = x*10 + uint64(digits[2]-'0')
        x = x*10 + uint64(digits[3]-'0')
        x = x*10 + uint64(digits[4]-'0')
        x = x*10 + uint64(digits[5]-'0')
        x = x*10 + uint64(digits[6]-'0')
        x = x*10 + uint64(digits[7]-'0')
        x = x*10 + uint64(digits[8]-'0')
        x = x*10 + uint64(digits[9]-'0')
        x = x*10 + uint64(digits[10]-'0')
        x = x*10 + uint64(digits[11]-'0')
        x = x*10 + uint64(digits[12]-'0')
        x = x*10 + uint64(digits[13]-'0')
        x = x*10 + uint64(digits[14]-'0')
        x = x*10 + uint64(digits[15]-'0')
        x = x*10 + uint64(digits[16]-'0')
        x = x*10 + uint64(digits[17]-'0')
        x = x*10 + uint64(digits[18]-'0')
        x = x*10 + uint64(digits[19]-'0')

Curiously, a very branch-free but highly data-dependent implementation turns out to be most performant.

One property of Go’s standard compiler that shines in this analysis is its reserved stance on optimization. Whereas with a modern C or C++ compiler, one couldn’t as directly relate source-level optimization techniques to the CPU’s view of the program, global rewrites here don’t happen by themselves. One can see Pdui64_Unroll_Appr02’s constants present in the object code and Pdui64_Unrollonly_Appr00 only knowing of 1e19, in hexadecimal $0x8ac7230489e80000:

GOEXPERIMENT=simd go test -o /tmp/bin -c && go tool objdump -S -s Pdui64_Unroll_Appr02 /tmp/bin

                uint64(digits[0]-'0')*10000000000000000000 +
  0x5a9c3b              0fb6c9                  MOVZX CL, CX
  0x5a9c3e              49bf0000e8890423c78a    MOVQ $0x8ac7230489e80000, R15
  0x5a9c48              490fafcf                IMULQ R15, CX
                uint64(digits[1]-'0')*1000000000000000000 +
  0x5a9c4c              0fb6d2                  MOVZX DL, DX
  0x5a9c4f              49bf000064a7b3b6e00d    MOVQ $0xde0b6b3a7640000, R15
  0x5a9c59              4c0faffa                IMULQ DX, R15
                uint64(digits[0]-'0')*10000000000000000000 +
  0x5a9c5d              4c01f9                  ADDQ R15, CX
                uint64(digits[2]-'0')*100000000000000000 +
  0x5a9c60              400fb6d6                MOVZX SI, DX
  0x5a9c64              48be00008a5d78456301    MOVQ $0x16345785d8a0000, SI
  0x5a9c6e              480faff2                IMULQ DX, SI
                uint64(digits[1]-'0')*1000000000000000000 +
  0x5a9c72              4801f1                  ADDQ SI, CX

GOEXPERIMENT=simd go test -o /tmp/bin -c && go tool objdump -S -s Pdui64_Unrollonly_Appr00 /tmp/bin

        x = x*10 + uint64(digits[17]-'0')
  0x5ab3ab              4801c9                  ADDQ CX, CX
  0x5ab3ae              488d0c89                LEAQ 0(CX)(CX*4), CX
  0x5ab3b2              0fb6542432              MOVZX 0x32(SP), DX
  0x5ab3b7              0fb6d2                  MOVZX DL, DX
  0x5ab3ba              4801d1                  ADDQ DX, CX
        x = x*10 + uint64(digits[18]-'0')
  0x5ab3bd              4801c9                  ADDQ CX, CX
  0x5ab3c0              488d0c89                LEAQ 0(CX)(CX*4), CX
  0x5ab3c4              0fb6542431              MOVZX 0x31(SP), DX
  0x5ab3c9              0fb6d2                  MOVZX DL, DX
  0x5ab3cc              4801d1                  ADDQ DX, CX
        x = x*10 + uint64(digits[19]-'0')
  0x5ab3cf              4801c9                  ADDQ CX, CX
  0x5ab3d2              488d0c89                LEAQ 0(CX)(CX*4), CX
  0x5ab3d6              0fb6d3                  MOVZX BL, DX
  0x5ab3d9              488d040a                LEAQ 0(DX)(CX*1), AX
        if len(s) == len(maxUint64) && (s[0] != '1' || x < 1e19) {
  0x5ab3dd              488b4c2450              MOVQ 0x50(SP), CX
  0x5ab3e2              4883f914                CMPQ CX, $0x14
  0x5ab3e6              7527                    JNE 0x5ab40f
  0x5ab3e8              488b4c2448              MOVQ 0x48(SP), CX
  0x5ab3ed              803931                  CMPB 0(CX), $0x31
  0x5ab3f0              7513                    JNE 0x5ab405
  0x5ab3f2              48b90000e8890423c78a    MOVQ $0x8ac7230489e80000, CX
  0x5ab3fc              0f1f4000                NOPL 0(AX)
  0x5ab400              4839c1                  CMPQ CX, AX
  0x5ab403              760a                    JBE 0x5ab40f

Loop unrolling is an ancient technique anecdotically the bedrock of PostScript’s market dominance from 1984 onwards (Brailsford 2016), used with great effect still nearly thirty years later (Alexandrescu 2012) and whose performance advantages I was able to show when fiddling with strconv.ParseUint’s implementation.
Nevertheless, varying execution characteristics for, from within the language’s semantics, η-equivalent programs constitutes a drippingly leaky abstraction, firmly nestled at the equivocal interplay betwixt reality’s physicality and comprehensibility. (van Hardenberg 2022, i) Its reign as a go-to optimization technique is moreover confined to the higher ends of computing, with sufficient level-one instruction caches cushioning a bloated program text. When stepping outside these parameters, the heuristics fall apart, and without vigilant system-wide integration testing may culminate in weighty performance penalizations, such as was likely the case thirty years ago with the Nintendo⁠ ⁠® 64⁠ ⁠™’s launch title. (Emanuar 2024)

With SIMD resounding through the lands the past couple of months (Boreham 2025) (Knyszek and Clements 2025), I had teetered on the edge of hand-writing some .s for quite some while when reading about GOEXPERIMENT=simd (Go “simd/archsimd” 2026), which got me excited to less circuitously realize a long-cherished dream of mine: to write some SIMD.

GOEXPERIMENT=simd go doc -src Pdui64_Avx2_Appr05
package pdui // import "."

func Pdui64_Avx2_Appr05(s string) (x uint64, ok bool) {
        switch {
        case !archsimd.X86.AVX2():
                panic("simd/archsimd.X86Features.AVX2 isn't supported")

        case len(s) == 0 || len(s) > 20:
                return 0, false

        case s[0] == '0':
                return 0, s == "0"

        case len(s) == 20:
                if s > "18446744073709551615" {
                        return 0, false
                }
                fallthrough

        default:
                switch {
                case len(s) <= 16:
                        return pdui64_Avx2_Appr05_16(s)

                default:
                        lo, loOk := pdui64_Avx2_Appr05_16(s[len(s)-16:])
                        hi, hiOk := pdui64_Avx2_Appr05_4(s[:len(s)-16])
                        if !loOk || !hiOk {
                                return 0, false
                        }
                        return uint64(hi)*1_0000_0000_0000_0000 + lo, true
                }
        }
}

GOEXPERIMENT=simd go doc -src -u pdui64_Avx2_Appr05_16
package pdui // import "jfrech.de/blog/300/pdui"

func pdui64_Avx2_Appr05_16(s string) (x uint64, ok bool) {
        if len(s) > 16 {
                panic("impossible")
        }

        var raw = [16]byte{
                '0', '0', '0', '0', '0', '0', '0', '0',
                '0', '0', '0', '0', '0', '0', '0', '0',
        }
        copy(raw[len(raw)-len(s):],
                unsafe.Slice(unsafe.StringData(s), len(s)))
        ascii := archsimd.LoadUint8x16(&raw)

        a := ascii.Sub(archsimd.BroadcastUint8x16('0'))
        if a.GreaterEqual(archsimd.BroadcastUint8x16(10)).
                ToBits() != 0 {
                return 0, false
        }

        var b archsimd.Uint32x8 = a.AsInt8x16().
                ExtendToInt16().DotProductPairs(
                int16x16_thousandHundredTenOne).AsUint32x8()
        var c archsimd.Uint32x8 = b.Mul(uint32x8_tenthousandOne)

        lo := c.GetLo()
        hi := c.GetHi()

        var d [8]uint32
        lo.Store((*[4]uint32)(d[0:4]))
        hi.Store((*[4]uint32)(d[4:8]))

        return uint64(d[0]+d[1]+d[2]+d[3])*1_0000_0000 +
                uint64(d[4]+d[5]+d[6]+d[7]), true
}

GOEXPERIMENT=simd go doc -src -u int16x16_thousandHundredTenOne | grep -i int16x16_thousandHundredTenOne -A 5
        int16x16_thousandHundredTenOne = archsimd.LoadInt16x16(new([16]int16{
                1000, 100, 10, 1,
                1000, 100, 10, 1,
                1000, 100, 10, 1,
                1000, 100, 10, 1,
        }))

GOEXPERIMENT=simd go doc -src -u uint32x8_tenthousandOne | grep -i uint32x8_tenthousandOne -A 3
        uint32x8_tenthousandOne = archsimd.LoadUint32x8(new([8]uint32{
                1_0000, 1_0000, 1, 1,
                1_0000, 1_0000, 1, 1,
        }))

GOEXPERIMENT=simd go doc -src -u pdui64_Avx2_Appr05_4
package pdui // import "jfrech.de/blog/300/pdui"

func pdui64_Avx2_Appr05_4(s string) (x uint16, ok bool) {
        if len(s) > 4 {
                panic("impossible")
        }

        if len(s) < 1 || len(s) > 16 {
                panic("impossible")
        }

        var raw = [4]byte{
                '0', '0', '0', '0',
        }
        copy(raw[len(raw)-len(s):],
                unsafe.Slice(unsafe.StringData(s), len(s)))

        if !('0' <= raw[0] && raw[0] <= '9' &&
                '0' <= raw[1] && raw[1] <= '9' &&
                '0' <= raw[2] && raw[2] <= '9' &&
                '0' <= raw[3] && raw[3] <= '9') {
                return 0, false
        }

        return 1000*uint16(raw[0]-'0') +
                100*uint16(raw[1]-'0') +
                10*uint16(raw[2]-'0') +
                1*uint16(raw[3]-'0'), true
}

Yet after days of coming up with various AVX2-based Pdui64 implementations, I couldn’t overtake the standard library in performance, with my best-performing Pdui64_Avx2_Appr05 being +⁠11.66⁠ ⁠% behind.

benchstat -col /implementation@alpha -filter '.name:"Pdui64" AND .fullname:/Std-|c:Avx2_Appr05-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
         │    a:Std    │            c:Avx2_Appr05            │
         │   sec/op    │   sec/op     vs base                │
Pdui64-8   37.06m ± 0%   41.37m ± 0%  +11.66% (p=0.000 n=30)

One could ask if Pdui64_Avx2_Appr05 may benefit from function inlining, which is firmly answered negatively by both Pdui64_Avx2_Appr16 and Pdui64_Avx2_Appr17.

After having first switched gears from Pdui64 to Pdui32, as AVX2’s register size felt more equipped to handle parsing of only ten digits, fitting into an uint8x16 register, I was befuddled that all my implementation attempts performed within ±0.01⁠ ⁠% of each other. Confused by this precise a match, I cooked up wild speculations about the microcode optimizer possibly fully comprehending the problem and thus behaving equally across all representations I had provided. Alas, I had tested Pdui32 against the string representations of uniformly chosen 64-bit integers, effectively only hitting the common code path of rejecting long strings.
Benchmarking over string representations of uniformly chosen 32-bit integers shows AVX2 intrinsics to here be beneficial.

benchstat -col /implementation@alpha -filter '.name:"Pdui32" AND .fullname:/Std-|b:Unroll_Appr00-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
         │    a:Std    │           b:Unroll_Appr00           │
         │   sec/op    │   sec/op     vs base                │
Pdui32-8   22.25m ± 0%   16.15m ± 0%  -27.38% (p=0.000 n=30)

benchstat -col /implementation@alpha -filter '.name:"Pdui32" AND .fullname:/Std-|c:Avx2_Appr11-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
         │    a:Std    │            c:Avx2_Appr11            │
         │   sec/op    │   sec/op     vs base                │
Pdui32-8   22.25m ± 0%   13.13m ± 0%  -40.97% (p=0.000 n=30)

Since a data-dependent right shift of an entire SIMD register doesn’t seem feasible, cf. Pdui64_Avx2_Appr00, one may be tempted to statically permute all digits to lay in reverse order, assume the most-significant digit is at its highest place and divide by the appropriate power of ten at the very end. Unfortunately, neither 1<<64-1 nor 1<<32-1 have as their most-significant decimal digit the digit nine, causing overflows when not calculating in a bit width one above the parsée. As such, implementing Pdui64 is off the table in both AVX2 and AVX-512. 64-bit wide multiplication simd/archsimd.Uint64x2.Mul being AVX-512-only has halted my persuing of this approach for Pdui32, as I don’t currently have access to an AVX-512-capable processor.
Anyhow, Pdui32_Avx2_Appr05 woefully underperforms:

benchstat -col /implementation@alpha -filter '.name:"Pdui32" AND .fullname:/Std-|c:Avx2_Appr05-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
         │    a:Std    │            c:Avx2_Appr05             │
         │   sec/op    │   sec/op     vs base                 │
Pdui32-8   22.25m ± 0%   51.31m ± 0%  +130.65% (p=0.000 n=30)

GOEXPERIMENT=simd go doc -src Pdui32_Avx2_Appr05
package pdui // import "."

func Pdui32_Avx2_Appr05(s string) (uint32, bool) {
        if s == "" ||
                s[0] == '0' && len(s) != 1 ||
                len(s) > len(maxUint32) ||
                len(s) == len(maxUint32) && s > maxUint32 {
                return 0, false
        }

        n := len(s)
        a := archsimd.LoadUint8x16SlicePart(
                unsafe.Slice(unsafe.StringData(s), n))
        b := a.PermuteOrZero(int8x16_reverse)
        if b.Sub(archsimd.BroadcastUint8x16('0')).
                Greater(archsimd.BroadcastUint8x16(9)).
                ToBits()>>(16-n) != 0 {
                return 0, false
        }
        c := b.SubSaturated(archsimd.BroadcastUint8x16('0'))
        d := c.ExtendToUint16()
        e := d.Mul(
                uint16x16_1_10_1_10_1_10_1_10_1_10_100_1000_1_10_100_1000)

        f := e.AddPairsGrouped(e)
        g := f.AddPairsGrouped(f)
        h0, h1 := g.GetLo(), g.GetHi()
        i := h0.InterleaveLo(h1)
        j := i.ExtendLo4ToUint32()
        var z [4]uint32
        j.Store(&z)

        x := uint64(z[3])*1_00_0000 + uint64(z[2]+z[1]*1_00)
        for range 10 - len(s) {
                x /= 10
        }
        return uint32(x), true
}

In light of these results, my Pdui64 implementations not benefitting from AVX2 over multiple approaches as opposed to Pdui32, folly::to<unsigned long long> (Folly 2012, i) in 2012 not being founded on SIMD acceleration (Alexandrescu 2012, ii) might be due to AVX-512 not being proposed for over a year (Reinders 2013) after Folly’s public release (Alexandrescu 2012, iii) and only hitting the market in Q⁠ ⁠4 of 2016 (Intel⁠ ⁠® n. d.). Then again, Go and C++ might be incomparable enough to not warrant translating the possibility of AVX-512 aiding Pdui64, extrapolated from AVX2 having aided Pdui32. After all, present-day folly::to<unsigned long long> (Folly 2026, i) doesn’t appear to make use of SIMD, either.

Final benchmarks were compiled with Go 1.26.1 and run on Debian⁠ ⁠® 13.3 with Linux⁠ ⁠® 6.12.74 from a cold boot without a graphical environment running on an Intel⁠ ⁠® Core⁠ ⁠™ i7⁠-⁠4790K, which was released in Q⁠ ⁠2 of 2014 and whose instruction set architecture, short ISA, includes 256-bit-wide AVX2 but excludes 512-bit-wide AVX-512.
However, benchmark runs of varying parameters running on a system in graphical use and unclear state influenced implementations and analysis, constituting a bias of unknown extent.
Benchmarks’ input distribution were a million uniformly-picked bitness-appropriate integers’ decimal string forms. Benchmarks were shuffled, with cross-run-contamination unlikely to be present:

benchstat -col /implementation@alpha -filter '.name:"Pdui64" AND .fullname:/Std-|a:Std_shuffledIn-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
         │    a:Std    │       a:Std_shuffledIn        │
         │   sec/op    │   sec/op     vs base          │
Pdui64-8   37.06m ± 0%   36.97m ± 1%  ~ (p=0.504 n=30)

benchstat -col /implementation@alpha -filter '.name:"Pdui32" AND .fullname:/Std-|a:Std_shuffledIn-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
         │    a:Std    │       a:Std_shuffledIn        │
         │   sec/op    │   sec/op     vs base          │
Pdui32-8   22.25m ± 0%   22.21m ± 0%  ~ (p=0.063 n=30)

Pdui implementations are BSD-licenced, available either through https://pkg.go.dev/jfrech.de/blog/300/pdui or git clone https://jfrech.de/blog/300/pdui.git, wherefrom withal raw benchmark results can be perused.

Out of this analysis fell noticing a semantically relevant documentation-implementation divergence for strconv.ParseUint and in micro-benchmarks a −40⁠ ⁠% faster Pdui64 implementation through classical methods together with a beneficially AVX2-intrinsics-aided Pdui32 implementation.
I wasn’t able to get a performance edge over classical methods through AVX2 in Pdui64, which is what I had initially set out to find, leaving me with the feeling that totalling up a wide register isn’t a task that gains from AVX2; pariwise add operations have to be repeated to calculate sums multiple times in parallel, of which all but one get immediately discarded. Even Pdui32 only benefitted from the syntactical check being written as a wide register compare, the calculation’s core remaining classical.
What surprised me is the delicate interplay between conceptual-lexical ILP-friendliness and unrolling; both have clearly observable performance-favourable implications in isolation, yet they do not in general compound. That is to say, thinking in those macro descriptions of CPU behaviour can lead one to untapped performance, though the total lack of a unifying theory inescapably shows through. All benchmarks this text is based on were run on a used decade-old processor; I have little doubt they are not representative of wide classes of hardware, with a likely fruitful next step being to look at how AVX-512 instructions or different non-x86 ISAs alltogether fair.

Of parsing dec­i­mal in­te­ger rep­re­sent­ations, in­struc­tion-level par­al­lel­ism, short-cir­cuit­ing over­flow de­tec­tion and AVX2. (#300)

Of parsing decimal integer representations, instruction-level parallelism, short-circuiting overflow detection and AVX2. (#300)