Giddily Gitting it wrong (#278)

Jonathan Frech, 30 September 2023

With GitHub soon closing its door for me with their ongoing 2FA fad⁠¹ and the winter semester starting tomorrow, it’s about time to yank the Joy Assembler repo out of US-controlled grasp and refactor the project into a 2023-presentable state. Fortunately, I have not made use of GitHub’s non-Git offerings (wiki, issue tracker, CI/CD, …), making the hostage rescue a simple git clone --mirror.

Skimming the Git log, however, to my dismay I realised inconsistencies in my author and committer names and e-mail addresses: git log --pretty=format:'%an <%ae> | %cn <%ce>' | sort | uniq even reports the committer “GitHub <noreply@github.com>” for the initial commit which I find a tad odd since it implies I only authored my project’s first commit, GitHub immortalising its brand as the committer: meddling with the very first commit, after all, invalidates all object IDs.
Following a bettered SemVer⁠² understanding, I also want to re-tag v1.0 to v0.1.0. So fully obliterating compatibility poses no hindrance for me.
Thus, seeing October as a fresh start, I want to both move repo hosting as well as sanitise my repo history’s metadata. All without losing project history, which I find important to keep: bug introductions, design considerations and personal development in how to code make a repo’s history worthwile. Defining the HEAD tree as the new initial commit, consequently, is not an option. But neither is mucking around with Git’s default CLI as losing a commit or accidentally updating the committer date to today are too likely of an outcome.
What to do?⸺Write a full Git database implementation from scratch (at least enough to read in my repo of interest) and use said API to write a one-off tool to perform the above described facelift.

-=-

Noteworthily, I had the wish for a pure Go implementation to manipulate Git repositories for a while by now; my current webgit.jfrech.com backend performs on each HTTP request a system call to instruct the Git command to clone the entire repo into a temporary directory, reads the one file requested and destroys everything again. Being able to attach an "io/fs".FS directly to a repo’s commit’s tree would be a far more elegant solution.
Another motivation is over 1TiB of TAR’d bare and non-bare Git repositories which accrued over the past years for fear of someone up to mischief stealing my credentials, my VPS provider’s data centres burning down or a careless git push --force (of course followed by an aggressive garbage collection to truly lose data). Verifying every single repo against an up-to-date state and reviewing all dangling objects is not something I look forward to doing manually.
Benefiting from the oh so often touted wonders of open-source (and sometimes even half-free to free) third-party libraries is not an option for me since i) what I truly relish is Go’s ethos bringing the UNIX way of computing into the networked world of computing. Alas, this design philosophy seems to me to be respected by few, followed by fewer. And if you have a clumsy Java-like Oracle-approved library written in Go, you don’t have a Go library. Relatedly, ii) I evermore have trouble trusting feature-bloated software, an unfortunate fate of many popular (correlating with functional, usable and well-tested⸺all desired properties) solutions (I find Git itself to be too fat to truly trust). Speaking of it, iii) in this world it, on top of everything noted, is increasingly hard to trust software. And if I have to read the entire library to verify it, then pin one particular version, I can just as well cut out the middleman and write it myself.

So I started out last Tuesday parsing my bare Git repository. Five days of staring at garbled binary representations later, I managed to fully parse an objects/pack/pack-*.pack file, my implementation achieving bit-for-bit equality with Git v2.41.0’s git verify-pack -v. A bumpy ride it was:

Most disheartening was Git’s awful documentation regarding the packfile binary format (overly superficial and in my opinion in various key points plain wrong). With git-scm.org⁠³, in parts a mirror of the official Git documentation, dominating search results, Git’s source convoluted and packfile format intrinsics coverage sporadic, the packfile revealed itself to me only bit by bit.

Most eye-opening for me was the following.

“size” can i) either refer to a constructed object’s (OBJ_COMMIT, OBJ_TREE, OBJ_BLOB (and probably OBJ_TAG, though I have not yet encountered it)) size which coincides with git cat-file -s’ output which is different to its ii) compressed size which is the size of the packfile section where the object resides which has is different to iii) a constructing object’s (OBJ_OFS_DELTA (and probably OBJ_REF_DELTA, though I have not yet encountered it)) uncompressed and compressed sizes.
The packfile object header only records uncompressed object size (the size of the constructed object or the size of the delta instructions) but the packfile stores compressed data. A zlib reader should know when to stop consuming input.
These sizes are encoded using at least three different methods: 32-bit unsigned big-endian, variable-length unsigned (little-endian, the common varint) (with for the pack object header cramming in three bits of object type information; “The per-object header is a pretty dense thing”⁠⁴) and variable-length unsigned big-endian (!) where redundancy in elongated representations is used to represent more integers in less bytes (a discrepancy of 1<<7 or (1<<7)+(1<<14) to a known, correctly parsed value is indicative).
Arbitrary data often parses to something. Only because the bytes could represent an offset does not mean following it leads to anything.

The following is a list of references I got use out of. Invaluable was furthermore a sufficiently large packfile to test hypotheses against.

[T23a] Linus Torvalds et al.: “git/Documentation/gitformat-pack.txt”. 2023-09-22. Online: https://git.kernel.org/pub/scm/git/git.git/tree/Documentation/gitformat-pack.txt?id=493f4622739e9b64f24b465b21aa85870dd9dc09 [accessed 2023-09-30]
[T23b] Linus Torvalds et al.: “git/pack-write.c”. 2023-09-21. Online: https://git.kernel.org/pub/scm/git/git.git/tree/pack-write.c?id=bcb6cae2966cc407ca1afc77413b3ef11103c175 [accessed 2023-09-27]
[C11] Scott Chacon et al.: “Git Community Book” (7. Internals and Plumbing ; The Packfile). 2008—2011. Online: https://shafiul.github.io/gitbook/7_the_packfile.html [accessed 2023-09-28]
Git v2.41.0’s git verify-pack -v and man git-verify-pack
[M] Aditya Mukerjee: “Unpacking Git packfiles”. Undated. Online: https://codewords.recurse.com/issues/three/unpacking-git-packfiles [accessed 2023-09-29]
[S21] Caleb Sander: “Git Internals part 2: packfiles”. 2021-11-27. Online: https://dev.to/calebsander/git-internals-part-2-packfiles-1jg8 [accessed 2023-09-29]

I am also convinced that Git v2.41.0’s git verify-pack -s -v is bugged as its output is identical when removing the -v flag despite the manpage saying the flag had an effect.
I find it furthermore unfortunate how Go’s "compress/zlib".NewReader type-system-circumventingly imposes the io.ByteReader interface when one does not want input bytes after the compressed data to mysteriously vanish. Due to a very subtle interface contract postulation that a non-nil error implies no bytes were read, an io.ByteReader cannot be constructed from an io.Reader without this construction itself retaining possible overread. Since I do not want to force in my API’s contract the use of *bufio.Reader, I feel forced to use interface‍{io.Reader‍; io‍.ByteReader‍}, which meshes poorly with the rest of io (e. g. there is no “io.TeeByteReader”).

With all this complexity I sometimes wonder if all of this is worth it. Surely, with network and disk space increasing as well as transparent compression (e. g. in SSH or ZFS) and transparent deduplication (e. g. in ZFS) becoming more prevalent, the advantages of implementing a delta DSL on top of zlib must warrant their cost complexity less and less.

My library is available as pkg.jfrech.com/go/git@v0.1.0†, though I make no compatibility promises regarding v0.2.0 whose release is planned. For probably many repos, git‍.ErrIncompleteImplementation will pop up in abundance.

[1]	[G23] GitHub: “Raising the bar for software security: GitHub 2FA begins March 13”. 2023-03-09. Online: https://github.blog/2023-03-09-raising-the-bar-for-software-security-github-2fa-begins-march-13/ [accessed 2023-09-30]
[2]	[P23] Tom Preston-Werner et al.: “Semantic Versioning 2.0.0”. 2023-01-16. Online: https://semver.org/ [accessed 2023-09-30], https://raw.githubusercontent.com/semver/semver/38a25311c9933d0fd8b9b017866b8be9a405f4ec/semver.md [accessed 2023-09-30]
[3]	[C23] Scott Chacon et al.: “git-scm.org”. 2012—2023. Online: https://git-scm.com/ [accessed 2023-09-30]
[4]	[T23b] Linus Torvalds et al.: “git/pack-write.c”. 2023-09-21. Online: https://git.kernel.org/pub/scm/git/git.git/tree/pack-write.c?id=bcb6cae2966cc407ca1afc77413b3ef11103c175 [accessed 2023-09-27]