Cool discovery but the article could have been about 1/10 as long and still communicated effectively. At least they didn't post it as a video, so it was easy to skim to the important details.
I'm confused why you wouldn't simply snapshot the block-level device if the protocol of the information on top is going to cause this much headache. Quiescing git operations for block level activity is probably not trivial, but it sounds like an easier problem to solve to me.
This is the approach I've taken with SQLite in production environments. Turn on WAL and the problem gets even easier to solve. Customer configures the VM for snapshots every X minutes. Git presumably doesn't have something approximating a WAL, so I understand the hesitation with this path. But, I still think the overall strategy is much more viable and robust to weird edges within git.
This feels like some of the problems I’ve ended up tackling. The people too close to the problem don’t see it so someone has to reach across the aisle and make something happen in code they normally wouldn’t touch.
Even with the quadratic complexity I suspect this problem has been brewing for a while. This commit has been biding its time for 16 years waiting to ruin someone’s day (let’s not confuse the utility of Make it Work early in a project with justifying retaining that code in perpetuity. Make it Right, Make it Fast.) Backups have probably been going over six hours for years and steadily climbing.
“We” feels a lot like one or two people jumping on a grenade.
There seems to be a lesson here about the balance between premature vs. anticipatory optimization. We’re generally warned against premature optimization but perhaps, as a rule of thumb, we should look for optimizations in frequently-called functions that are obvious and not onerous to implement.
I used a hash to begin with, using a simplistic digest function to the message headers I was comparing, getting me a 4-byte hash key.
That worked, but was kind of slow.
Finally, the idea came to me to not apply _any_ digesting, and use the combined concatenated headers of the messages as hash keys. About 2k bytes per hash key!
The result: About 20x perf improvement if memory serves.
How is that possible? The reason is that the code all runs in a Javascript machine; and applying the digest was not a built-in function, it was looping over the headers and doing the arithmetic. Thousands upon thousands of JS abstract machine steps. The use of the large hash key may be inefficient, but - it's just one JS object / dictionary operation, and one of the most heavily-optimized in any implementation.
A very good example that writing code in C doesn't help for performance, when the algorithms or data structures aren't properly taken into consideration.
> Ultimately, we traced the issue to a 15-year-old Git function with O(N²) complexity and fixed it with an algorithmic change, reducing backup times exponentially.
I have yet to finish the article, but this means they improved the complexity to something like O(logN), right? I hate when people confuse quadratic improvement for exponential ones.
Lot of comments complaining that going from O(n^2) to O(log n) is not an exponential improvement, but it is indeed an exponential improvement. In fact O(n) is exponentially more than O(log n).
One of the many reasons I moved to self hosting. I use ZFS to backup every 15 minutes, ... could do it even more frequently but that seems a little pointless.
Also moved away from Gitlab because it's so damn slow.
> What this means for GitLab customers — [a bunch of stuff about how customers can now back up more frequently and more robustly]
Realtalk: They should rewrite this post's headline to be in a positive tense instead of leading with a negative word. I'm glad I read the post, because it is a cool and good fix, but I saw “Decreasing […] repo backup” and my first thought was that it was an announcement of some service downgrade like some sort of cost-cutting measure.
Here comes an unpopular nitpick: "... we traced the issue to a 15-year-old Git function with O(N²) complexity and fixed it with an algorithmic change, reducing backup times exponentially."
Uh no you didn't. Not possible. At most a polynomial reduction is possible else complexity theory needs a re-write.
(OK, yes, k could be doing some heavy lifting here, but I doubt it.)
If you are going to quote a maths formula then please don't use "exponetially" to mean "lots".
I stopped reading there: I don't want to have to read each word and wonder if they actually meant it, or it's just bad hype.
…and they say leetcode-style problems are out of date.
While perhaps implementing low level sorts is silly, knowing which data structures to use and when, and what underlying big-o performance they have, is critical, as demonstrated here.
What a poor write-up, neither defining the problem nor the solution with any clearity. Did anyone come away with any information that can be used for anything?
TLDR if you only add objects to an array that are already not contained in said array you won't have to iterate back through it to remove the duplicates you created.
wild that this a pattern like this would be part of git-core, but I guess we all overlook stuff on a regular basis
Well done, next maybe make the web ui usable because right now there’s absolutely no distinction between the UI itself and the user content, which IMHO combined with action buttons in the middle of the page makes for a really poor ux.
Are there any reimplementations of git, by professional programmers using real tools? The source in question — object.c — is "banging rocks together" material.
How we decreased GitLab repo backup times from 48 hours to 41 minutes
(about.gitlab.com)592 points by nayak 6 June 2025 | 250 comments
Comments
https://github.com/git/git/commit/bb74c0abbc31da35be52999569...
I don’t write exotic algorithms, but it’s always astounding how small n needs to be to become observably problematic.
Why aren't they just snapshotting and archiving the full git repo? Does `git bundle` add something over frequent ZFS backups?
If the backup times were O(n^2), are they now O(n^2 / 2^n)? I would guess not.
This is the approach I've taken with SQLite in production environments. Turn on WAL and the problem gets even easier to solve. Customer configures the VM for snapshots every X minutes. Git presumably doesn't have something approximating a WAL, so I understand the hesitation with this path. But, I still think the overall strategy is much more viable and robust to weird edges within git.
Even with the quadratic complexity I suspect this problem has been brewing for a while. This commit has been biding its time for 16 years waiting to ruin someone’s day (let’s not confuse the utility of Make it Work early in a project with justifying retaining that code in perpetuity. Make it Right, Make it Fast.) Backups have probably been going over six hours for years and steadily climbing.
“We” feels a lot like one or two people jumping on a grenade.
https://addons.thunderbird.net/en-US/thunderbird/addon/remov...
I used a hash to begin with, using a simplistic digest function to the message headers I was comparing, getting me a 4-byte hash key.
That worked, but was kind of slow.
Finally, the idea came to me to not apply _any_ digesting, and use the combined concatenated headers of the messages as hash keys. About 2k bytes per hash key!
The result: About 20x perf improvement if memory serves.
How is that possible? The reason is that the code all runs in a Javascript machine; and applying the digest was not a built-in function, it was looping over the headers and doing the arithmetic. Thousands upon thousands of JS abstract machine steps. The use of the large hash key may be inefficient, but - it's just one JS object / dictionary operation, and one of the most heavily-optimized in any implementation.
The worst troublesome cases of inefficient production are almost always O(n^2).
I have yet to finish the article, but this means they improved the complexity to something like O(logN), right? I hate when people confuse quadratic improvement for exponential ones.
Quadratic complexity sits in an awkward sweet spot: Fast enough for medium-sized n to pass first QA, but doomed to fail eventually as n grows.
Also moved away from Gitlab because it's so damn slow.
Realtalk: They should rewrite this post's headline to be in a positive tense instead of leading with a negative word. I'm glad I read the post, because it is a cool and good fix, but I saw “Decreasing […] repo backup” and my first thought was that it was an announcement of some service downgrade like some sort of cost-cutting measure.
Uh no you didn't. Not possible. At most a polynomial reduction is possible else complexity theory needs a re-write.
(OK, yes, k could be doing some heavy lifting here, but I doubt it.)
If you are going to quote a maths formula then please don't use "exponetially" to mean "lots".
I stopped reading there: I don't want to have to read each word and wonder if they actually meant it, or it's just bad hype.
While perhaps implementing low level sorts is silly, knowing which data structures to use and when, and what underlying big-o performance they have, is critical, as demonstrated here.
wild that this a pattern like this would be part of git-core, but I guess we all overlook stuff on a regular basis