Jonathan. Frech’s WebBlog

Giddily Gitting it wrong (#278)

Jonathan Frech

With GitHub soon closing its door for me with their ongoing 2FA fad⁠¹ and the winter semester starting tomorrow, it’s about time to yank the Joy Assembler repo out of US-controlled grasp and refactor the project into a 2023-presentable state. Fortunately, I have not made use of GitHub’s non-Git offerings (wiki, issue tracker, CI/CD, …), making the hostage rescue a simple git clone --mirror.

Skimming the Git log, however, to my dismay I realised inconsistencies in my author and committer names and e-mail addresses: git log --pretty=format:'%an <%ae> | %cn <%ce>' | sort | uniq even reports the committer “GitHub <noreply@github.com>” for the initial commit which I find a tad odd since it implies I on­ly authored my project’s first commit, GitHub immortalising its brand as the committer: meddling with the very first commit, after all, invalidates all object IDs.
Following a bettered SemVer⁠² understanding, I also want to re-tag v1.0 to v0.1.0. So fully obliterating compatibility poses no hindrance for me.
Thus, seeing October as a fresh start, I want to both move repo hosting as well as sanitise my repo history’s metadata. All without losing project history, which I find important to keep: bug introductions, de­sign considerations and personal de­vel­op­ment in how to code make a repo’s history worthwile. Defining the HEAD tree as the new initial commit, consequently, is not an option. But neither is mucking around with Git’s default CLI as losing a commit or accidentally updating the committer date to today are too likely of an outcome.
What to do?⸺Write a full Git database im­ple­men­ta­tion from scratch (at least enough to read in my repo of interest) and use said API to write a one-off tool to perform the above described facelift.

-=-

Noteworthily, I had the wish for a pure Go im­ple­men­ta­tion to manipulate Git repositories for a while by now; my current webgit.jfrech.com backend performs on each HTTP request a system call to instruct the Git command to clone the en­tire repo into a temporary directory, reads the one file requested and destroys everything again. Being able to attach an "io/fs".FS directly to a repo’s commit’s tree would be a far more elegant solution.
Another motivation is over 1TiB of TAR’d bare and non-bare Git repositories which accrued over the past years for fear of someone up to mischief stealing my credentials, my VPS provider’s data centres burning down or a careless git push --force (of course followed by an aggressive gar­bage col­lec­tion to truly lose data). Verifying every single repo against an up-to-date state and reviewing all dangling objects is not some­thing I look forward to doing manually.
Benefiting from the oh so often touted wonders of open-source (and sometimes even half-free to free) third-party libraries is not an option for me since i) what I truly relish is Go’s ethos bringing the UNIX way of com­put­ing into the networked world of com­put­ing. Alas, this de­sign philosophy seems to me to be respected by few, followed by fewer. And if you have a clumsy Java-like Oracle-approved library written in Go, you don’t have a Go library. Relatedly, ii) I evermore have trouble trusting feature-bloated software, an unfortunate fate of many popular (correlating with functional, usable and well-tested⸺all desired properties) solutions (I find Git itself to be too fat to truly trust). Speaking of it, iii) in this world it, on top of everything noted, is increasingly hard to trust software. And if I have to read the en­tire library to verify it, then pin one particular version, I can just as well cut out the middleman and write it myself.

So I started out last Tuesday parsing my bare Git repository. Five days of staring at garbled binary rep­re­sent­ations later, I managed to fully parse an objects/pack/pack-*.pack file, my im­ple­men­ta­tion achieving bit-for-bit equality with Git v2.41.0’s git verify-pack -v. A bumpy ride it was:

Most disheartening was Git’s awful documentation re­gard­ing the pack­file binary format (overly superficial and in my opinion in various key points plain wrong). With git-scm.org⁠³, in parts a mirror of the official Git documentation, dominating search results, Git’s source convoluted and pack­file format intrinsics coverage sporadic, the pack­file revealed itself to me on­ly bit by bit.

Most eye-opening for me was the following.

The following is a list of references I got use out of. Invaluable was furthermore a sufficiently large pack­file to test hypotheses against.

I am also convinced that Git v2.41.0’s git verify-pack -s -v is bugged as its output is identical when removing the -v flag despite the manpage saying the flag had an effect.
I find it furthermore unfortunate how Go’s "compress/zlib".NewReader type-system-cir­cum­vent­ing­ly imposes the io.ByteReader interface when one does not want input bytes after the com­pressed data to mysteriously vanish. Due to a very subtle interface contract postulation that a non-nil error implies no bytes were read, an io.ByteReader cannot be constructed from an io.Reader without this construction itself retaining possible overread. Since I do not want to force in my API’s contract the use of *bufio.Reader, I feel forced to use interface‍{​io.Reader‍; io‍.​Byte​Reader‍}, which meshes poorly with the rest of io (e. g. there is no “io.TeeByteReader”).

With all this complexity I sometimes wonder if all of this is worth it. Surely, with network and disk space increasing as well as transparent compression (e. g. in SSH or ZFS) and transparent deduplication (e. g. in ZFS) becoming more prevalent, the advantages of im­ple­ment­ing a delta DSL on top of zlib must warrant their cost complexity less and less.

My library is avail­able as pkg.jfrech.com/go/git@v0.1.0, though I make no compatibility promises re­gard­ing v0.2.0 whose release is planned. For probably many repos, git‍.​Err​In​com​plete​Im​ple​men​ta​tion will pop up in abundance.


[1][G23] GitHub: “Raising the bar for software security: GitHub 2FA begins March 13”. 2023-03-09. Online: https://github.blog/2023-03-09-raising-the-bar-for-software-security-github-2fa-begins-march-13/ [accessed 2023-09-30]
[2][P23] Tom Preston-Werner et al.: “Semantic Versioning 2.0.0”. 2023-01-16. Online: https://semver.org/ [accessed 2023-09-30], https://raw.githubusercontent.com/semver/semver/38a25311c9933d0fd8b9b017866b8be9a405f4ec/semver.md [accessed 2023-09-30]
[3][C23] Scott Chacon et al.: “git-scm.org”. 2012—2023. Online: https://git-scm.com/ [accessed 2023-09-30]
[4][T23b] Linus Torvalds et al.: “git/pack-write.c”. 2023-09-21. Online: https://git.kernel.org/pub/scm/git/git.git/tree/pack-write.c?id=bcb6cae2966cc407ca1afc77413b3ef11103c175 [accessed 2023-09-27]