Jonathan. Frech’s WebBlog

Giddily Gitting it wrong (#278)

Jonathan Frech,

With Git­Hub soon closing its door for me with their ongoing 2FA fad⁠¹ and the winter semester starting tomorrow, it’s about time to yank the Joy Assembler repo out of US-controlled grasp and refactor the pro­ject into a 2023-presentable state. Fortunately, I have not made use of Git­Hub’s non-Git offerings (wiki, issue track­er, CI/CD, …), making the hostage rescue a simple git clone --mirror.

Skimming the Git log, how­ev­er, to my dismay I realised inconsistencies in my au­thor and committer names and e-mail addresses: git log --pretty=format:'%an <%ae> | %cn <%ce>' | sort | uniq even reports the committer “GitHub <noreply@github.com>” for the initial com­mit which I find a tad odd since it implies I on­ly authored my pro­ject’s first com­mit, Git­Hub immortalising its brand as the committer: meddling with the very first com­mit, after all, invalidates all ob­ject IDs.
Following a bettered SemVer⁠² un­der­stand­ing, I also want to re-tag v1.0 to v0.1.0. So fully obliterating compatibility poses no hindrance for me.
Thus, seeing October as a fresh start, I want to both move repo hosting as well as sanitise my repo his­to­ry’s metadata. All with­out losing pro­ject his­to­ry, which I find important to keep: bug introductions, de­sign con­sid­er­a­tions and personal de­vel­op­ment in how to code make a repo’s his­to­ry worthwile. Defining the HEAD tree as the new initial com­mit, consequently, is not an op­tion. But nei­ther is mucking around with Git’s default CLI as losing a com­mit or accidentally updating the committer date to today are too likely of an outcome.
What to do?⸺Write a full Git database im­ple­men­ta­tion from scratch (at least enough to read in my repo of interest) and use said API to write a one-off tool to per­form the above described facelift.

-=-

Noteworthily, I had the wish for a pure Go im­ple­men­ta­tion to manipulate Git re­pos­i­to­ries for a while by now; my cur­rent webgit.jfrech.com backend performs on each HTTP request a system call to instruct the Git command to clone the en­tire repo into a temporary directory, reads the one file requested and destroys everything again. Being able to attach an "io/fs".FS di­rect­ly to a repo’s com­mit’s tree would be a far more elegant solution.
Another motivation is over 1TiB of TAR’d bare and non-bare Git re­pos­i­to­ries which accrued over the past years for fear of some­one up to mischief stealing my credentials, my VPS provider’s data centres burning down or a careless git push --force (of course fol­low­ed by an aggressive gar­bage col­lec­tion to truly lose data). Verifying every single repo against an up-to-date state and reviewing all dangling ob­jects is not some­thing I look forward to doing manually.
Benefiting from the oh so often touted wonders of open-source (and sometimes even half-free to free) third-party libraries is not an op­tion for me since i) what I truly relish is Go’s ethos bringing the UNIX way of com­put­ing into the networked world of com­put­ing. Alas, this de­sign philosophy seems to me to be respected by few, fol­low­ed by fewer. And if you have a clumsy Java-like Oracle-approved li­brary writ­ten in Go, you don’t have a Go li­brary. Relatedly, ii) I evermore have trouble trusting fea­ture-bloat­ed soft­ware, an unfortunate fate of many popular (correlating with functional, usable and well-tested⸺all desired properties) solutions (I find Git itself to be too fat to truly trust). Speaking of it, iii) in this world it, on top of everything noted, is increasingly hard to trust soft­ware. And if I have to read the en­tire li­brary to verify it, then pin one particular version, I can just as well cut out the middleman and write it myself.

So I started out last Tuesday parsing my bare Git re­pos­i­to­ry. Five days of staring at garbled binary rep­re­sent­ations later, I managed to fully parse an objects/pack/pack-*.pack file, my im­ple­men­ta­tion achieving bit-for-bit equality with Git v2.41.0’s git verify-pack -v. A bumpy ride it was:

Most disheartening was Git’s awful doc­u­men­ta­tion re­gard­ing the pack­file binary format (overly superficial and in my opinion in var­i­ous key points plain wrong). With git-scm.org⁠³, in parts a mirror of the official Git doc­u­men­ta­tion, dominating search results, Git’s source convoluted and pack­file format intrinsics coverage sporadic, the pack­file revealed itself to me on­ly bit by bit.

Most eye-opening for me was the following.

The following is a list of references I got use out of. Invaluable was fur­ther­more a sufficiently large pack­file to test hypotheses against.

I am also convinced that Git v2.41.0’s git verify-pack -s -v is bugged as its output is identical when removing the -v flag de­spite the manpage saying the flag had an ef­fect.
I find it fur­ther­more unfortunate how Go’s "compress/zlib".NewReader type-system-cir­cum­vent­ing­ly imposes the io.ByteReader interface when one does not want input bytes after the com­pressed data to mysteriously vanish. Due to a very subtle interface contract postulation that a non-nil error implies no bytes were read, an io.ByteReader cannot be constructed from an io.Reader with­out this construction itself retaining possible overread. Since I do not want to force in my API’s contract the use of *bufio.Reader, I feel forced to use interface‍{​io.Reader‍; io‍.​Byte​Reader‍}, which meshes poorly with the rest of io (e. g. there is no “io.TeeByteReader”).

With all this complexity I sometimes wonder if all of this is worth it. Surely, with network and disk space increasing as well as transparent compression (e. g. in SSH or ZFS) and transparent deduplication (e. g. in ZFS) becoming more prevalent, the advantages of im­ple­ment­ing a delta DSL on top of zlib must war­rant their cost complexity less and less.

My li­brary is avail­able as pkg.jfrech.com/go/git@v0.1.0, though I make no compatibility promises re­gard­ing v0.2.0 whose re­lease is planned. For probably many repos, git‍.​Err​In​com​plete​Im​ple​men​ta​tion will pop up in abundance.


[1][G23] Git­Hub: “Raising the bar for soft­ware security: Git­Hub 2FA begins March 13”. 2023-03-09. Online: https://github.blog/2023-03-09-raising-the-bar-for-software-security-github-2fa-begins-march-13/ [ac­cess­ed 2023-09-30]
[2][P23] Tom Preston-Werner et al.: “Se­man­tic Versioning 2.0.0”. 2023-01-16. Online: https://semver.org/ [ac­cess­ed 2023-09-30], https://raw.githubusercontent.com/semver/semver/38a25311c9933d0fd8b9b017866b8be9a405f4ec/semver.md [ac­cess­ed 2023-09-30]
[3][C23] Scott Chacon et al.: “git-scm.org”. 2012—2023. Online: https://git-scm.com/ [ac­cess­ed 2023-09-30]
[4][T23b] Linus Torvalds et al.: “git/pack-write.c”. 2023-09-21. Online: https://git.kernel.org/pub/scm/git/git.git/tree/pack-write.c?id=bcb6cae2966cc407ca1afc77413b3ef11103c175 [ac­cess­ed 2023-09-27]