Git repository history visualized in tig

Adventures in Monorepo Rewriting

August 3rd, 2024 ⸺

#git
#software-development

Tl;Dr

I used git-filter-repo to reorganize three packages, in two different monorepos, into a new monorepo while maintaining the commit history in the new repository.

The Setup

Way back when I first started working on github.com/32bitkid/sci.js, I realized pretty early that I needed a BitReader even to get started working on the asset decompression and parsing. I didn't really see an existing one that I liked, so I wrote one. I knew relatively early on that a monorepo was probably going to be a good idea; both for some general separation-of-concerns while maintaining some ability to rapidly prototype ideas. I didn't really have a better home for my BitReader, so I just threw it in with the rest of the SCI0 stuff…

Time Goes On…

Later, while working on sci.js, I wanted a fast ring-buffer TypedArray-backed deque. Again, I looked around, and didn't see anything that really fit for what I wanted from the interface. So, again, I just built one and threw it in with the rest of the SCI0 stuff. But it definitely was a head-tilting moment. This was the second-time I had shoved a more general-purpose data-structure in a domain-specific monorepo. I decided it was okay for now—refocusing back on the scope of the overall project.

But… there was something brewing.

Third Time's the Charm!

Fast-forward a few months, and I was working on something totally unrelated: a quadtree and ennetree. For a hot-minute, I thought that maybe a dense float64 resizable-vector might outperform the native array. I thought about using the Vector class from mnemonist, but I wanted to be able to snag quadruples of floats from the backing TypedArray with subarray(), something mnemonist doesn't support¹.

Long story short, I wrote it and implemented it. Running the benchmarks, it didn't perform any better. It didn't perform terribly worse, but it was way more complex with a lot more moving parts. So, I took it out of quadtree/ennetree. However, it was still useful—like, it wasn't flawed—it just didn't provide the performance benefits that I was hoping it would. But, now I had a bit a problem…

The Problem

I had this BitReader-class and Deque-class in one monorepo, and a Vector-class in another monorepo. Neither of which really belonged in their parent repositories. They actually had more in common with each other, than they had in the repos they were stored in.

But I also didn't want to lose all the commit history and context that I'd built up in those repos. I really wanted to snatch those three packages from their respective repos, and plop them in a new repo, where I could develop and document them in tandem. But how to do this?

The Process

We probably could go through and try to manually rewrite history with git rebase -i, but given that these were all from monorepos there was a lot of cross development across packages. I wasn't, perhaps, as disciplined as I should have been crafting my commits. So, I pretty knew that wasn't going to be an option. Luckily, there is a tool just for stuff like this: git-filter-repo.

Checking out the documentation, the --path option sounds like exactly what I'm looking for!

Let's say that we have two monorepos: repo1 and repo2, I want to make new repo3, that contains both libs/package-a from repo1 and libs/package-b from repo2.

Step 1: Rewrite Repo1

Clone the repo1 repository².

cd ~/src
git clone repo1 repo1-package-a

cd ~/src/repo1-package-a
git checkout -b only-package-a
git filter-repo --path libs/package-a

Your branch should now only contain commits that are related to libs/package-a.

Step 2: Rewrite Repo2

Repeat for the repo2 repository².

cd ~/src
git clone repo2 repo2-package-b

cd ~/src/repo2-package-b
git checkout -b only-package-b
git filter-repo --path libs/package-b

Step 3: The Merge

We should now have two repositories with only the commits that we care about for package-a and package-b. Now we need to create a new repo and merge these disparate histories together. So lets make a new folder and initialize a git repository.

mkdir ~/src/new-repo
cd ~/src/repo-3
git init

Let's just create an initial commit for this new repo by initializing a README.md

echo "Monorepo for package-a and package-b" > README.md
git add README.md
git commit -m 'Initial commit'

Now, we can add those two other repos with the history that we want as remotes.

git remote add repo1-package-a ../repo1-package-a
git remote add repo2-package-b ../repo2-package-b
git remote update

Next we need to merge in those two branches from the repositories we created earlier. Because our current branch and the other branches we want to merge come from entirely separate commit histories, we need to use the --allow-unrelated-histories option to merge.

git merge --allow-unrelated-histories repo1-package-a/only-package-a
git merge --allow-unrelated-histories repo2-package-b/only-package-b

And blammo! We should have our two packages now in the same repo, with their respective histories intact³.

Step 4: The Cleanup

Now, we can remove those temp remotes:

git remote remove repo1-package-a
git remote remove repo2-package-b

And then clean up the temp repos repo1-package-a and repo2-package-b. Finally, you'll probably want to remove the package folders from repo1 and repo2, respectively.

Conclusion

You probably won't ever need to do this, especially if you spend a little more time on planning than I did. But if you do, just remember git is a really powerful tool! You can more-or-less coerce it into doing almost anything you want, given enough effort. Even in times like this, where git rebase -i is insufficient, a myriad of alternatives and other approaches exist. Sometimes, you just have to figure out how to express what you want.

You can find my result of this process at github.com/32bitkid/4bitlabs.bits.

Total aside: it's kinda obvious why mnemonist doesn't support doing this. Getting array pointers into a resizable backing array which may no longer be valid as the vector is mutated is demonstrably a BadIdea™. I ended up trashing this idea from my own implementation before publishing it. ↩
You don't need to clone the repo for this to work, you can do it in the main repo on another branch, but I whenever I do this kind of relatively hardcore history-rewriting I find that its often easier and safer to do it on a clone of the repo. ↩ ↩²
Depending on how your repos are set up, and what your process is, you might need to clean up tags for unrelated histories. But if you don't use tags then you can ignore this. ↩