Jonathan. Frech’s WebBlog

Of parsing dec­i­mal in­te­ger rep­re­sent­ations, in­struc­tion-level par­al­lel­ism, short-cir­cuit­ing over­flow de­tec­tion and AVX2. (#300)

Jonathan Frech,

Data is an inherently meta­phys­i­cal de­scrip­tion of state whose many forms are uni­fied by con­tex­tual­ly ap­pro­pri­ate un­der­stand­ing there­of. Where­as the most bla­tant dis­play of this con­tex­tu­al de­pend­ence is observed when not shield­ing rep­re­sent­ations from ig­no­rant phys­i­cal tampering, one must on­ly think of some­one cut­ting a punched card in two ar­bi­trari­ly and gluing both parts next to each oth­er in re­verse order, these forms can also exist em­bed­ded in them­selves meta­phys­i­cal arrangements.
Such is the case in the stringly-typed Web, where lines be­tween ordinals in all their Platonicity, registers etched in sil­i­con and every-day writ­ten word blur, a uint64 in­dis­tin­guish­able from the 8-bit-a-byte ASCII en­cod­ing of a min­i­mal-length dec­i­mal. No wonder that in this en­vi­ron­ment both the mi­nu­tia of what it means to be a string tight­ly coupled with per­me­abil­i­ty of form be­come a fer­tile ground to eke out unrestingly drying up drops of per­for­mance.

Touch­ing string im­ple­men­ta­tions is in­tri­cate with expected pay-offs some­where around −1⁠ ⁠% (Ormrod 2016, i), where close-knit com­mu­ni­ca­tion be­tween the en­tire system’s al­lo­ca­tor is es­sen­tial. (Alex­an­dres­cu 2012)
GCC 5.1 chose in 2015 (GNU 2024) to em­ploy small buf­fer op­ti­mi­za­tion, here called short-string op­ti­mi­za­tion and hence­forth ab­bre­vi­at­ed to SSO, for libstdc⁠+⁠+ (GNU 2025), which optimizes for the case of many string ob­jects being live at once, where cache page misses start to out­weigh the bulkier string han­dles’ cache line occupations. (Ormrod 2016, iii) De­spite these tan­gled con­sid­er­a­tions at play when opting for SSO, many im­ple­men­ta­tions’ con­ver­gence on it has prompted some authors to por­tray it as de­ci­sive­ly su­pe­ri­or. (Stroustrup 2018, i)
Being too pro­voc­a­tive with one’s de­sign here also risks heavy in­sta­bil­i­ty burdens (Ormrod 2016, ii). All of the above paired with the open ques­tion of how the itself in­de­pen­dent­ly evolv­ing Go memory man­age­ment story (Knyszek and Clements 2025) would in­ter­play with SSO are likely factors The Go Pro­ject’s flag­ship com­pil­er’s res­o­lute avoid­ance of this im­ple­men­ta­tion strat­e­gy. (Cox “com­ment on Go issue 18894” 2017)

Less daunt­ing than replacing the string back­ing, the trans­la­tion step be­tween tex­tu­al decimals and reg­is­ter-sized uint64s offers more prom­is­ing op­por­tu­ni­ty for “the silicion [not to] stay dark” (Alex­an­dres­cu 2012, i): im­ple­men­ta­tions based on a left-to-right read­ing order, like Go 1.26.1’s strconv.ParseUint, released 2026-03-05 (Go “1.26.1 re­lease” 2026), pre­sent the com­pu­ta­tion of the dot prod­uct be­tween a num­ber’s digits and suc­ces­sive powers of ten in a highly data-de­pen­dent way, meshing poorly with cur­rent-day CPUs’ heuristics.
Attempting to write an in­struc­tion-level-par­al­lel­ism-friend­ly, hence­forth ab­bre­vi­at­ed to ILP-friend­ly, prob­lem pre­sen­ta­tion for strconv.ParseUint, through both spell­ing out the up-to-twenty-di­men­sion­al dot prod­uct stat­i­cal­ly and loop un­roll­ing, lead me both to discovering a subtle mis­match be­tween strconv.ParseUint’s doc­u­men­ta­tion pre­sent since Go’s in­cep­tion as well as an over −40⁠ ⁠% per­for­mance gain in a micro-bench­mark over uni­form­ly dis­trib­ut­ed uint64s’ dec­i­mal rep­re­sent­ations. Restricting the prob­lem and bench­mark to 32 bits, AVX2 intrinsics allowed for an again over −40⁠ ⁠% im­prove­ment over strconv.ParseUint(x, 10, 32), al­be­it on­ly the syntactical check benefitting from AVX in­struc­tion VPCMPGTW (Go “Uint16x8.Great­er” 2026), used indirectly through un­signed com­par­i­son em­u­la­tion.

“If s is empty or contains in­va­lid digits, err.Err = ErrSyntax (...)” strconv.ParseUint claims of its be­hav­iour (Go “strconv” 2026), though in the his­to­ry of Go 1 this has never been the case:

// Cf. https://jfrech.de/blog/300 (accessed 2026-04-06)
package main

import (
	"fmt"
	"strconv"
	"strings"
)

func main() {
	x, err := strconv.ParseUint(strings.Repeat("9", 20)+"nine", 10, 64)
	fmt.Println(x)
	fmt.Println(err)
}

// go version go1.26.1 linux/amd64
// Output:
// 18446744073709551615
// strconv.ParseUint: parsing "99999999999999999999nine": value out of range

Over­flow reporting takes pri­or­i­ty over read­ing the full string (Go “in­ter­nal/strconv” 2026), myopically declaring a whole slew of syntactical mal­for­ma­tions as out of range. Introduced in late 2008, with over sev­en­teen years of age, this be­hav­iour is over three years older than Go 1. One com­mit prior, in­va­lid digits were al­ways re­ject­ed but over­flow de­tec­tion was still stubbed. (Cox “lib/strconv” 2008)

There isn’t much use of strconv.ParseUint or its signed sib­ling inside of the stan­dard li­brary, which was un­ex­pect­ed. Most packages just roll their own, not bothering with bit sizes, bases or underscores. One such lo­cal­ly-writ­ten im­ple­men­ta­tion is en­cod­ing/json/v2’s encoding/json/internal/jsonwire.ParseUint, which even has a unit test case for the be­hav­iour of an over­flow­ing dec­i­mal in­te­ger rep­re­sent­ation fol­low­ed by syn­tac­tic­ally il­le­gal text (Go “jsonwire test” 2025); had two equal­ly-named im­ple­men­ta­tions shared their test cases, strconv.ParseUint’s doc­u­men­ta­tion-im­ple­men­ta­tion dis­crep­an­cy might have been un­earth­ed then and there.
Json/en­cod­ing/v2’s json/en­cod­ing/in­ter­nal/jsonwire.ParseUint’s over­flow check comes close to classifying as golfed: the con­di­tion “b[0] !⁠= '1' |⁠| v < 1⁠e⁠19” (Go “jsonwire” 2025) combines an in­te­gral-valued untyped float­ing-point con­stant’s abil­i­ty to adopt the guise of a uint64 with cer­tain­ty about over­flow be­hav­iour.
It is also highly data-de­pen­dent, as is strconv.ParseUint. My over −40⁠ ⁠% faster, as re­port­ed by a micro-bench­mark, im­ple­men­ta­tion in hand, I looked at its per­for­mance in­te­grat­ed in a larger system, namely the en­cod­ing/json/v2 bench­mark­ing test suite. Yet the results are in­con­clu­sive: jsonv2.bench.txtar

From a cor­rect­ness point of view, I find it as­ton­ish­ing how an often-touched (Griesemer “com­ment on Go issue 31197” 2019) (Neil “Go issue 46641” 2021) (Ulen “Go issue 21275” 2017) (Ulen “Go issue 21278” 2017), assumedly de­lib­er­ate­ly straight-for­ward­ly-writ­ten rou­tine’s un­described be­hav­iour eschewed no­tice for over sev­en­teen years, with a wit­ness test case qui­et­ly sit­ting but a few packages down the street.

I’m hes­i­tant about want­ing this be­hav­iour changed. On the one hand, it to me looks like an in­tri­cate­ly confusing bug where I see no plau­si­bly useful de­pend­ence on. On the oth­er hand, I’ve gotten too jaded with backwards in­com­pat­i­bil­i­ty to myself ex­claim “let’s fix the bug.” (Cox “com­ment on Go issue 21278” 2017) See Go issue 78546, which tracks that ques­tion. (F. “Go issue 78546” 2026)

Fuzzing dif­fer­en­tial­ly against the stan­dard li­brary was the tech­nique that enriched a mere ex­er­cise in applying loop un­roll­ing and ILP-friend­ly op­er­a­tion order (Alex­an­dres­cu 2012) into a semantical ver­i­fi­ca­tion of a part of the Go stan­dard li­brary.

Apart from the in­tro­duc­to­ri­ly mentioned ap­pli­ca­tion of AVX2 intrinsics, the two clas­si­cal per­for­mance techniques I used are representing the com­pu­ta­tion in a way favourable for ILP and loop un­roll­ing, both to­geth­er named Pdui64_Un­roll_*, which achieves speeds at over −40⁠ ⁠%:

benchstat -col /implementation@alpha -filter '.name:"Pdui64" AND .fullname:/Std-|b:Unroll_Appr02-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
         │    a:Std    │           b:Unroll_Appr02           │
         │   sec/op    │   sec/op     vs base                │
Pdui64-8   37.06m ± 0%   19.99m ± 0%  -46.06% (p=0.000 n=30)
GOEXPERIMENT=simd go doc -src Pdui64_Unroll_Appr02
        x := 0 +
                uint64(digits[0]-'0')*10000000000000000000 +
                uint64(digits[1]-'0')*1000000000000000000 +
                uint64(digits[2]-'0')*100000000000000000 +
                uint64(digits[3]-'0')*10000000000000000 +
                uint64(digits[4]-'0')*1000000000000000 +
                uint64(digits[5]-'0')*100000000000000 +
                uint64(digits[6]-'0')*10000000000000 +
                uint64(digits[7]-'0')*1000000000000 +
                uint64(digits[8]-'0')*100000000000 +
                uint64(digits[9]-'0')*10000000000 +
                uint64(digits[10]-'0')*1000000000 +
                uint64(digits[11]-'0')*100000000 +
                uint64(digits[12]-'0')*10000000 +
                uint64(digits[13]-'0')*1000000 +
                uint64(digits[14]-'0')*100000 +
                uint64(digits[15]-'0')*10000 +
                uint64(digits[16]-'0')*1000 +
                uint64(digits[17]-'0')*100 +
                uint64(digits[18]-'0')*10 +
                uint64(digits[19]-'0')*1 +
                0

Where­as I at first blush as­sumed the two to be compounding (Alex­an­dres­cu 2012), separating Pdui64_Un­roll, which in­cludes ILP, out into Pdui64_Ilpon­ly and Pdui64_Un­rollon­ly fails to show their per­for­mance gains adding up, with ILP alone at under −30⁠ ⁠% and un­roll­ing alone out-com­pet­ing unrolled ILP-friend­ly:

benchstat -col /implementation@alpha -filter '.name:"Pdui64" AND .fullname:/Std-|d:Ilponly_Appr02/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
         │    a:Std    │          d:Ilponly_Appr02           │
         │   sec/op    │   sec/op     vs base                │
Pdui64-8   37.06m ± 0%   26.00m ± 0%  -29.84% (p=0.000 n=30)
GOEXPERIMENT=simd go doc -src Pdui64_Ilponly_Appr02
        for j := range len(s) {
                c := s[len(s)-1-j]
                if !('0' <= c && c <= '9') {
                        return 0, false
                }
                x += uint64(c-'0') * pow10
                pow10 *= 10
        }

Pre-com­put­ing powers of ten on the stack, as Pdui64_Ilponly_Appr00 does, performs worse than com­put­ing them along­side parsing.

benchstat -col /implementation@alpha -filter '.name:"Pdui64" AND .fullname:/Std-|e:Unrollonly_Appr00/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
         │    a:Std    │         e:Unrollonly_Appr00         │
         │   sec/op    │   sec/op     vs base                │
Pdui64-8   37.06m ± 0%   19.35m ± 1%  -47.78% (p=0.000 n=30)
GOEXPERIMENT=simd go doc -src Pdui64_Unrollonly_Appr00
        x = x*10 + uint64(digits[0]-'0')
        x = x*10 + uint64(digits[1]-'0')
        x = x*10 + uint64(digits[2]-'0')
        x = x*10 + uint64(digits[3]-'0')
        x = x*10 + uint64(digits[4]-'0')
        x = x*10 + uint64(digits[5]-'0')
        x = x*10 + uint64(digits[6]-'0')
        x = x*10 + uint64(digits[7]-'0')
        x = x*10 + uint64(digits[8]-'0')
        x = x*10 + uint64(digits[9]-'0')
        x = x*10 + uint64(digits[10]-'0')
        x = x*10 + uint64(digits[11]-'0')
        x = x*10 + uint64(digits[12]-'0')
        x = x*10 + uint64(digits[13]-'0')
        x = x*10 + uint64(digits[14]-'0')
        x = x*10 + uint64(digits[15]-'0')
        x = x*10 + uint64(digits[16]-'0')
        x = x*10 + uint64(digits[17]-'0')
        x = x*10 + uint64(digits[18]-'0')
        x = x*10 + uint64(digits[19]-'0')

Cu­ri­ous­ly, a very branch-free but highly data-de­pen­dent im­ple­men­ta­tion turns out to be most per­for­mant.

One prop­er­ty of Go’s stan­dard com­pil­er that shines in this anal­y­sis is its re­served stance on op­ti­mi­za­tion. Where­as with a modern C or C++ com­pil­er, one couldn’t as di­rect­ly relate source-level op­ti­mi­za­tion techniques to the CPU’s view of the pro­gram, global rewrites here don’t happen by them­selves. One can see Pdui64_Un­roll_Appr02’s constants pre­sent in the ob­ject code and Pdui64_Unrollonly_Appr00 on­ly know­ing of 1e19, in hexa­dec­i­mal $0x8ac7230489e80000:

GOEXPERIMENT=simd go test -o /tmp/bin -c && go tool objdump -S -s Pdui64_Unroll_Appr02 /tmp/bin

                uint64(digits[0]-'0')*10000000000000000000 +
  0x5a9c3b              0fb6c9                  MOVZX CL, CX
  0x5a9c3e              49bf0000e8890423c78a    MOVQ $0x8ac7230489e80000, R15
  0x5a9c48              490fafcf                IMULQ R15, CX
                uint64(digits[1]-'0')*1000000000000000000 +
  0x5a9c4c              0fb6d2                  MOVZX DL, DX
  0x5a9c4f              49bf000064a7b3b6e00d    MOVQ $0xde0b6b3a7640000, R15
  0x5a9c59              4c0faffa                IMULQ DX, R15
                uint64(digits[0]-'0')*10000000000000000000 +
  0x5a9c5d              4c01f9                  ADDQ R15, CX
                uint64(digits[2]-'0')*100000000000000000 +
  0x5a9c60              400fb6d6                MOVZX SI, DX
  0x5a9c64              48be00008a5d78456301    MOVQ $0x16345785d8a0000, SI
  0x5a9c6e              480faff2                IMULQ DX, SI
                uint64(digits[1]-'0')*1000000000000000000 +
  0x5a9c72              4801f1                  ADDQ SI, CX

GOEXPERIMENT=simd go test -o /tmp/bin -c && go tool objdump -S -s Pdui64_Unrollonly_Appr00 /tmp/bin

        x = x*10 + uint64(digits[17]-'0')
  0x5ab3ab              4801c9                  ADDQ CX, CX
  0x5ab3ae              488d0c89                LEAQ 0(CX)(CX*4), CX
  0x5ab3b2              0fb6542432              MOVZX 0x32(SP), DX
  0x5ab3b7              0fb6d2                  MOVZX DL, DX
  0x5ab3ba              4801d1                  ADDQ DX, CX
        x = x*10 + uint64(digits[18]-'0')
  0x5ab3bd              4801c9                  ADDQ CX, CX
  0x5ab3c0              488d0c89                LEAQ 0(CX)(CX*4), CX
  0x5ab3c4              0fb6542431              MOVZX 0x31(SP), DX
  0x5ab3c9              0fb6d2                  MOVZX DL, DX
  0x5ab3cc              4801d1                  ADDQ DX, CX
        x = x*10 + uint64(digits[19]-'0')
  0x5ab3cf              4801c9                  ADDQ CX, CX
  0x5ab3d2              488d0c89                LEAQ 0(CX)(CX*4), CX
  0x5ab3d6              0fb6d3                  MOVZX BL, DX
  0x5ab3d9              488d040a                LEAQ 0(DX)(CX*1), AX
        if len(s) == len(maxUint64) && (s[0] != '1' || x < 1e19) {
  0x5ab3dd              488b4c2450              MOVQ 0x50(SP), CX
  0x5ab3e2              4883f914                CMPQ CX, $0x14
  0x5ab3e6              7527                    JNE 0x5ab40f
  0x5ab3e8              488b4c2448              MOVQ 0x48(SP), CX
  0x5ab3ed              803931                  CMPB 0(CX), $0x31
  0x5ab3f0              7513                    JNE 0x5ab405
  0x5ab3f2              48b90000e8890423c78a    MOVQ $0x8ac7230489e80000, CX
  0x5ab3fc              0f1f4000                NOPL 0(AX)
  0x5ab400              4839c1                  CMPQ CX, AX
  0x5ab403              760a                    JBE 0x5ab40f

Loop un­roll­ing is an an­cient tech­nique an­ec­dot­ic­ally the bed­rock of PostScript’s market dom­i­nance from 1984 onwards (Brailsford 2016), used with great ef­fect still nearly thirty years later (Alex­an­dres­cu 2012) and whose per­for­mance advantages I was able to show when fid­dling with strconv.ParseUint’s im­ple­men­ta­tion.
Nev­er­the­less, varying ex­e­cu­tion char­ac­ter­is­tics for, from within the lan­guage’s se­man­tics, η-eq­uiv­a­lent programs constitutes a drippingly leaky ab­strac­tion, firmly nestled at the equiv­o­cal in­ter­play be­twixt re­al­i­ty’s phys­i­cal­i­ty and com­pre­hen­si­bil­i­ty. (van Har­den­berg 2022, i) Its reign as a go-to op­ti­mi­za­tion tech­nique is more­over con­fined to the higher ends of com­put­ing, with suf­fi­cient level-one in­struc­tion caches cush­ion­ing a bloat­ed pro­gram text. When stepping out­side these pa­ram­e­ters, the heuristics fall apart, and with­out vig­i­lant system-wide in­te­gra­tion test­ing may cul­mi­nate in weighty per­for­mance pe­nal­iza­tions, such as was likely the case thirty years ago with the Nintendo⁠ ⁠® 64⁠ ⁠™’s launch title. (Emanuar 2024)

With SIMD re­sound­ing through the lands the past cou­ple of months (Boreham 2025) (Knyszek and Clements 2025), I had teetered on the edge of hand-writ­ing some .s for quite some while when read­ing about GOEXPERIMENT=simd (Go “simd/archsimd” 2026), which got me ex­cit­ed to less cir­cu­i­tously re­al­ize a long-cher­ish­ed dream of mine: to write some SIMD.

GOEXPERIMENT=simd go doc -src Pdui64_Avx2_Appr05
package pdui // import "."

func Pdui64_Avx2_Appr05(s string) (x uint64, ok bool) {
        switch {
        case !archsimd.X86.AVX2():
                panic("simd/archsimd.X86Features.AVX2 isn't supported")

        case len(s) == 0 || len(s) > 20:
                return 0, false

        case s[0] == '0':
                return 0, s == "0"

        case len(s) == 20:
                if s > "18446744073709551615" {
                        return 0, false
                }
                fallthrough

        default:
                switch {
                case len(s) <= 16:
                        return pdui64_Avx2_Appr05_16(s)

                default:
                        lo, loOk := pdui64_Avx2_Appr05_16(s[len(s)-16:])
                        hi, hiOk := pdui64_Avx2_Appr05_4(s[:len(s)-16])
                        if !loOk || !hiOk {
                                return 0, false
                        }
                        return uint64(hi)*1_0000_0000_0000_0000 + lo, true
                }
        }
}

GOEXPERIMENT=simd go doc -src -u pdui64_Avx2_Appr05_16
package pdui // import "jfrech.de/blog/300/pdui"

func pdui64_Avx2_Appr05_16(s string) (x uint64, ok bool) {
        if len(s) > 16 {
                panic("impossible")
        }

        var raw = [16]byte{
                '0', '0', '0', '0', '0', '0', '0', '0',
                '0', '0', '0', '0', '0', '0', '0', '0',
        }
        copy(raw[len(raw)-len(s):],
                unsafe.Slice(unsafe.StringData(s), len(s)))
        ascii := archsimd.LoadUint8x16(&raw)

        a := ascii.Sub(archsimd.BroadcastUint8x16('0'))
        if a.GreaterEqual(archsimd.BroadcastUint8x16(10)).
                ToBits() != 0 {
                return 0, false
        }

        var b archsimd.Uint32x8 = a.AsInt8x16().
                ExtendToInt16().DotProductPairs(
                int16x16_thousandHundredTenOne).AsUint32x8()
        var c archsimd.Uint32x8 = b.Mul(uint32x8_tenthousandOne)

        lo := c.GetLo()
        hi := c.GetHi()

        var d [8]uint32
        lo.Store((*[4]uint32)(d[0:4]))
        hi.Store((*[4]uint32)(d[4:8]))

        return uint64(d[0]+d[1]+d[2]+d[3])*1_0000_0000 +
                uint64(d[4]+d[5]+d[6]+d[7]), true
}

GOEXPERIMENT=simd go doc -src -u int16x16_thousandHundredTenOne | grep -i int16x16_thousandHundredTenOne -A 5
        int16x16_thousandHundredTenOne = archsimd.LoadInt16x16(new([16]int16{
                1000, 100, 10, 1,
                1000, 100, 10, 1,
                1000, 100, 10, 1,
                1000, 100, 10, 1,
        }))

GOEXPERIMENT=simd go doc -src -u uint32x8_tenthousandOne | grep -i uint32x8_tenthousandOne -A 3
        uint32x8_tenthousandOne = archsimd.LoadUint32x8(new([8]uint32{
                1_0000, 1_0000, 1, 1,
                1_0000, 1_0000, 1, 1,
        }))

GOEXPERIMENT=simd go doc -src -u pdui64_Avx2_Appr05_4
package pdui // import "jfrech.de/blog/300/pdui"

func pdui64_Avx2_Appr05_4(s string) (x uint16, ok bool) {
        if len(s) > 4 {
                panic("impossible")
        }

        if len(s) < 1 || len(s) > 16 {
                panic("impossible")
        }

        var raw = [4]byte{
                '0', '0', '0', '0',
        }
        copy(raw[len(raw)-len(s):],
                unsafe.Slice(unsafe.StringData(s), len(s)))

        if !('0' <= raw[0] && raw[0] <= '9' &&
                '0' <= raw[1] && raw[1] <= '9' &&
                '0' <= raw[2] && raw[2] <= '9' &&
                '0' <= raw[3] && raw[3] <= '9') {
                return 0, false
        }

        return 1000*uint16(raw[0]-'0') +
                100*uint16(raw[1]-'0') +
                10*uint16(raw[2]-'0') +
                1*uint16(raw[3]-'0'), true
}

Yet after days of com­ing up with var­i­ous AVX2-based Pdui64 im­ple­men­ta­tions, I couldn’t over­take the stan­dard li­brary in per­for­mance, with my best-per­form­ing Pdui64_Avx2_Appr05 being +⁠11.66⁠ ⁠% be­hind.

benchstat -col /implementation@alpha -filter '.name:"Pdui64" AND .fullname:/Std-|c:Avx2_Appr05-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
         │    a:Std    │            c:Avx2_Appr05            │
         │   sec/op    │   sec/op     vs base                │
Pdui64-8   37.06m ± 0%   41.37m ± 0%  +11.66% (p=0.000 n=30)

One could ask if Pdui64_Avx2_Appr05 may ben­e­fit from func­tion inlining, which is firmly answered neg­a­tive­ly by both Pdui64_Avx2_Appr16 and Pdui64_Avx2_Appr17.

After having first switched gears from Pdui64 to Pdui32, as AVX2’s reg­is­ter size felt more equipped to handle parsing of on­ly ten digits, fit­ting into an uint8x16 reg­is­ter, I was be­fud­dled that all my im­ple­men­ta­tion attempts per­form­ed within ±0.01⁠ ⁠% of each oth­er. Con­fused by this pre­cise a match, I cooked up wild speculations about the mi­cro­code op­ti­mizer pos­si­bly fully com­pre­hend­ing the prob­lem and thus behaving equal­ly across all rep­re­sent­ations I had pro­vid­ed. Alas, I had tested Pdui32 against the string rep­re­sent­ations of uni­form­ly cho­sen 64-bit integers, ef­fec­tive­ly on­ly hit­ting the com­mon code path of rejecting long strings.
Bench­mark­ing over string rep­re­sent­ations of uni­form­ly cho­sen 32-bit integers shows AVX2 intrinsics to here be ben­e­fi­cial.

benchstat -col /implementation@alpha -filter '.name:"Pdui32" AND .fullname:/Std-|b:Unroll_Appr00-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
         │    a:Std    │           b:Unroll_Appr00           │
         │   sec/op    │   sec/op     vs base                │
Pdui32-8   22.25m ± 0%   16.15m ± 0%  -27.38% (p=0.000 n=30)
benchstat -col /implementation@alpha -filter '.name:"Pdui32" AND .fullname:/Std-|c:Avx2_Appr11-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
         │    a:Std    │            c:Avx2_Appr11            │
         │   sec/op    │   sec/op     vs base                │
Pdui32-8   22.25m ± 0%   13.13m ± 0%  -40.97% (p=0.000 n=30)

Since a data-de­pen­dent right shift of an en­tire SIMD reg­is­ter doesn’t seem fea­si­ble, cf. Pdui64_Avx2_Appr00, one may be tempted to stat­i­cal­ly per­mute all digits to lay in re­verse order, as­sume the most-sig­nif­i­cant dig­it is at its highest place and di­vide by the ap­pro­pri­ate power of ten at the very end. Un­for­tu­nate­ly, nei­ther 1<<64-1 nor 1<<32-1 have as their most-sig­nif­i­cant dec­i­mal dig­it the dig­it nine, causing over­flows when not cal­cu­lat­ing in a bit width one above the parsée. As such, im­ple­ment­ing Pdui64 is off the table in both AVX2 and AVX-512. 64-bit wide mul­ti­pli­ca­tion simd/archsimd.Uint64x2.Mul being AVX-512-on­ly has halted my persuing of this ap­proach for Pdui32, as I don’t cur­rent­ly have ac­cess to an AVX-512-ca­pa­ble pro­ces­sor.
Any­how, Pdui32_Avx2_Appr05 woe­ful­ly un­der­per­forms:

benchstat -col /implementation@alpha -filter '.name:"Pdui32" AND .fullname:/Std-|c:Avx2_Appr05-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
         │    a:Std    │            c:Avx2_Appr05             │
         │   sec/op    │   sec/op     vs base                 │
Pdui32-8   22.25m ± 0%   51.31m ± 0%  +130.65% (p=0.000 n=30)
GOEXPERIMENT=simd go doc -src Pdui32_Avx2_Appr05
package pdui // import "."

func Pdui32_Avx2_Appr05(s string) (uint32, bool) {
        if s == "" ||
                s[0] == '0' && len(s) != 1 ||
                len(s) > len(maxUint32) ||
                len(s) == len(maxUint32) && s > maxUint32 {
                return 0, false
        }

        n := len(s)
        a := archsimd.LoadUint8x16SlicePart(
                unsafe.Slice(unsafe.StringData(s), n))
        b := a.PermuteOrZero(int8x16_reverse)
        if b.Sub(archsimd.BroadcastUint8x16('0')).
                Greater(archsimd.BroadcastUint8x16(9)).
                ToBits()>>(16-n) != 0 {
                return 0, false
        }
        c := b.SubSaturated(archsimd.BroadcastUint8x16('0'))
        d := c.ExtendToUint16()
        e := d.Mul(
                uint16x16_1_10_1_10_1_10_1_10_1_10_100_1000_1_10_100_1000)

        f := e.AddPairsGrouped(e)
        g := f.AddPairsGrouped(f)
        h0, h1 := g.GetLo(), g.GetHi()
        i := h0.InterleaveLo(h1)
        j := i.ExtendLo4ToUint32()
        var z [4]uint32
        j.Store(&z)

        x := uint64(z[3])*1_00_0000 + uint64(z[2]+z[1]*1_00)
        for range 10 - len(s) {
                x /= 10
        }
        return uint32(x), true
}

In light of these results, my Pdui64 im­ple­men­ta­tions not benefitting from AVX2 over mul­ti­ple ap­proach­es as op­posed to Pdui32, folly::to<un­signed long long> (Folly 2012, i) in 2012 not being founded on SIMD ac­cel­er­a­tion (Alex­an­dres­cu 2012, ii) might be due to AVX-512 not being proposed for over a year (Reinders 2013) after Folly’s pub­lic re­lease (Alex­an­dres­cu 2012, iii) and on­ly hit­ting the market in Q⁠ ⁠4 of 2016 (Intel⁠ ⁠® n. d.). Then again, Go and C++ might be in­com­pa­ra­ble enough to not war­rant translating the pos­si­bil­i­ty of AVX-512 aiding Pdui64, extrapolated from AVX2 having aided Pdui32. After all, pre­sent-day folly::to<un­signed long long> (Folly 2026, i) doesn’t ap­pear to make use of SIMD, ei­ther.

Final bench­marks were com­piled with Go 1.26.1 and run on Debian⁠ ⁠® 13.3 with Linux⁠ ⁠® 6.12.74 from a cold boot with­out a graphical en­vi­ron­ment run­ning on an Intel⁠ ⁠® Core⁠ ⁠™ i7⁠-⁠4790K, which was released in Q⁠ ⁠2 of 2014 and whose in­struc­tion set ar­chi­tec­ture, short ISA, in­cludes 256-bit-wide AVX2 but ex­cludes 512-bit-wide AVX-512.
How­ev­er, bench­mark runs of varying pa­ram­e­ters run­ning on a system in graphical use and un­clear state influenced im­ple­men­ta­tions and anal­y­sis, con­sti­tuting a bias of un­known ex­tent.
Bench­marks’ input dis­tri­bu­tion were a mil­lion uni­form­ly-picked bitness-ap­pro­pri­ate integers’ dec­i­mal string forms. Bench­marks were shuffled, with cross-run-con­tam­i­na­tion un­like­ly to be pre­sent:

benchstat -col /implementation@alpha -filter '.name:"Pdui64" AND .fullname:/Std-|a:Std_shuffledIn-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
         │    a:Std    │       a:Std_shuffledIn        │
         │   sec/op    │   sec/op     vs base          │
Pdui64-8   37.06m ± 0%   36.97m ± 1%  ~ (p=0.504 n=30)
benchstat -col /implementation@alpha -filter '.name:"Pdui32" AND .fullname:/Std-|a:Std_shuffledIn-/' testdata/2026-03-28.bench
goos: linux
goarch: amd64
pkg: jfrech.de/blog/300/pdui
cpu: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
         │    a:Std    │       a:Std_shuffledIn        │
         │   sec/op    │   sec/op     vs base          │
Pdui32-8   22.25m ± 0%   22.21m ± 0%  ~ (p=0.063 n=30)

Pdui im­ple­men­ta­tions are BSD-licenced, avail­able ei­ther through https://pkg.go.dev/jfrech.de/blog/300/pdui or git clone https://jfrech.de/blog/300/pdui.git, where­from withal raw bench­mark results can be perused.

Out of this anal­y­sis fell noticing a semantically rel­e­vant doc­u­men­ta­tion-im­ple­men­ta­tion di­ver­gence for strconv.ParseUint and in micro-bench­marks a −40⁠ ⁠% faster Pdui64 im­ple­men­ta­tion through clas­si­cal methods to­geth­er with a ben­e­fi­cial­ly AVX2-intrinsics-aided Pdui32 im­ple­men­ta­tion.
I wasn’t able to get a per­for­mance edge over clas­si­cal methods through AVX2 in Pdui64, which is what I had ini­tial­ly set out to find, leaving me with the feel­ing that totalling up a wide reg­is­ter isn’t a task that gains from AVX2; pariwise add operations have to be re­peat­ed to cal­cu­late sums mul­ti­ple times in par­al­lel, of which all but one get im­me­di­ate­ly discarded. Even Pdui32 on­ly benefitted from the syntactical check being writ­ten as a wide reg­is­ter com­pare, the cal­cu­la­tion’s core re­main­ing clas­si­cal.
What sur­prised me is the del­i­cate in­ter­play be­tween con­cep­tu­al-lex­i­cal ILP-friend­li­ness and un­roll­ing; both have clear­ly ob­serv­able per­for­mance-favourable implications in iso­la­tion, yet they do not in gen­er­al com­pound. That is to say, think­ing in those macro de­scrip­tions of CPU be­hav­iour can lead one to un­tapped per­for­mance, though the to­tal lack of a unifying theory inescapably shows through. All bench­marks this text is based on were run on a used de­cade-old pro­ces­sor; I have little doubt they are not rep­re­sen­ta­tive of wide classes of hard­ware, with a likely fruit­ful next step being to look at how AVX-512 instructions or dif­fer­ent non-x86 ISAs alltogether fair.