How we decreased GitLab repo backup times from 48 hours to 41 minutes

(about.gitlab.com)

Comments

divbzero 6 June 2025
The performance improvement that GitLab contributed to Git is slated to be released with v2.50.0:

https://github.com/git/git/commit/bb74c0abbc31da35be52999569...

LgWoodenBadger 6 June 2025
IME, it has always turned out to be the correct decision to eliminate any n^2 operation in anything I’ve written.

I don’t write exotic algorithms, but it’s always astounding how small n needs to be to become observably problematic.

SoftTalker 6 June 2025
Cool discovery but the article could have been about 1/10 as long and still communicated effectively. At least they didn't post it as a video, so it was easy to skim to the important details.
Arcuru 6 June 2025
48 hours is a crazy amount of time to spend just to compress a git folder, it's only a couple GB. 41 minutes still seems like quite a long time.

Why aren't they just snapshotting and archiving the full git repo? Does `git bundle` add something over frequent ZFS backups?

hiddew 6 June 2025
"fixed it with an algorithmic change, reducing backup times exponentially"

If the backup times were O(n^2), are they now O(n^2 / 2^n)? I would guess not.

bob1029 6 June 2025
I'm confused why you wouldn't simply snapshot the block-level device if the protocol of the information on top is going to cause this much headache. Quiescing git operations for block level activity is probably not trivial, but it sounds like an easier problem to solve to me.

This is the approach I've taken with SQLite in production environments. Turn on WAL and the problem gets even easier to solve. Customer configures the VM for snapshots every X minutes. Git presumably doesn't have something approximating a WAL, so I understand the hesitation with this path. But, I still think the overall strategy is much more viable and robust to weird edges within git.

hinkley 7 June 2025
This feels like some of the problems I’ve ended up tackling. The people too close to the problem don’t see it so someone has to reach across the aisle and make something happen in code they normally wouldn’t touch.

Even with the quadratic complexity I suspect this problem has been brewing for a while. This commit has been biding its time for 16 years waiting to ruin someone’s day (let’s not confuse the utility of Make it Work early in a project with justifying retaining that code in perpetuity. Make it Right, Make it Fast.) Backups have probably been going over six hours for years and steadily climbing.

“We” feels a lot like one or two people jumping on a grenade.

divbzero 6 June 2025
There seems to be a lesson here about the balance between premature vs. anticipatory optimization. We’re generally warned against premature optimization but perhaps, as a rule of thumb, we should look for optimizations in frequently-called functions that are obvious and not onerous to implement.
einpoklum 6 June 2025
I had a somewhat similar experience when writing a "remove duplicates" extension for Thunderbird:

https://addons.thunderbird.net/en-US/thunderbird/addon/remov...

I used a hash to begin with, using a simplistic digest function to the message headers I was comparing, getting me a 4-byte hash key.

That worked, but was kind of slow.

Finally, the idea came to me to not apply _any_ digesting, and use the combined concatenated headers of the messages as hash keys. About 2k bytes per hash key!

The result: About 20x perf improvement if memory serves.

How is that possible? The reason is that the code all runs in a Javascript machine; and applying the digest was not a built-in function, it was looping over the headers and doing the arithmetic. Thousands upon thousands of JS abstract machine steps. The use of the large hash key may be inefficient, but - it's just one JS object / dictionary operation, and one of the most heavily-optimized in any implementation.

gre 6 June 2025
looks like the commit is here: https://github.com/git/git/commit/a52d459e72b890c192485002ec...
pjmlp 6 June 2025
A very good example that writing code in C doesn't help for performance, when the algorithms or data structures aren't properly taken into consideration.
ashishb 6 June 2025
O(n^2) is fast enough to end up I'm production and slow enough to cause problems at scale.

The worst troublesome cases of inefficient production are almost always O(n^2).

edflsafoiewq 6 June 2025
So they replaced a nested for loop for checking for duplicates with a set data structure. Surprised such a common thing was in git.
SkiFire13 7 June 2025
> Ultimately, we traced the issue to a 15-year-old Git function with O(N²) complexity and fixed it with an algorithmic change, reducing backup times exponentially.

I have yet to finish the article, but this means they improved the complexity to something like O(logN), right? I hate when people confuse quadratic improvement for exponential ones.

judofyr 6 June 2025
See also: https://www.tumblr.com/accidentallyquadratic

Quadratic complexity sits in an awkward sweet spot: Fast enough for medium-sized n to pass first QA, but doomed to fail eventually as n grows.

kristianp 6 June 2025
It would have been nice if the post included some details, such as before and after code.
abhashanand1501 8 June 2025
Lot of comments complaining that going from O(n^2) to O(log n) is not an exponential improvement, but it is indeed an exponential improvement. In fact O(n) is exponentially more than O(log n).
tomxor 6 June 2025
One of the many reasons I moved to self hosting. I use ZFS to backup every 15 minutes, ... could do it even more frequently but that seems a little pointless.

Also moved away from Gitlab because it's so damn slow.

Lammy 6 June 2025
> What this means for GitLab customers — [a bunch of stuff about how customers can now back up more frequently and more robustly]

Realtalk: They should rewrite this post's headline to be in a positive tense instead of leading with a negative word. I'm glad I read the post, because it is a cool and good fix, but I saw “Decreasing […] repo backup” and my first thought was that it was an announcement of some service downgrade like some sort of cost-cutting measure.

ianks 7 June 2025
Reminds me of the timeless advice I got for acing leet code interviews: “remember to use a hash map.” Tends to work out pretty well.
SomaticPirate 6 June 2025
How was the flame graph created? (Not very familiar with C and the performance tools around it)
KETpXDDzR 7 June 2025
With git it should be straightforward to implement an incremental, dedup backup solution since all objects are stored with their hashs in filename.
katella 7 June 2025
I don't know much about git backups, but why would there be race conditions if you just create the backup off the local repo instead of remote?
treesknees 6 June 2025
DeepYogurt 7 June 2025
I miss a good tech blog post. Cheers for this
DamonHD 6 June 2025
Here comes an unpopular nitpick: "... we traced the issue to a 15-year-old Git function with O(N²) complexity and fixed it with an algorithmic change, reducing backup times exponentially."

Uh no you didn't. Not possible. At most a polynomial reduction is possible else complexity theory needs a re-write.

(OK, yes, k could be doing some heavy lifting here, but I doubt it.)

If you are going to quote a maths formula then please don't use "exponetially" to mean "lots".

I stopped reading there: I don't want to have to read each word and wonder if they actually meant it, or it's just bad hype.

gbacon 8 June 2025
Algorithmic improvements beat micro-optimizations.
sneak 6 June 2025
…and they say leetcode-style problems are out of date.

While perhaps implementing low level sorts is silly, knowing which data structures to use and when, and what underlying big-o performance they have, is critical, as demonstrated here.

iamcreasy 6 June 2025
Cool! Please post another flame graph after applying the patch.
jas39 7 June 2025
What a poor write-up, neither defining the problem nor the solution with any clearity. Did anyone come away with any information that can be used for anything?
niffydroid 6 June 2025
Why use this method over just having a script do a clone or pull elsewhere? Got is about distribution right?
James_K 8 June 2025
Perhaps nested for loops should be some kind of compiler warning.
fHr 7 June 2025
absolute chads, nice!
tcdent 6 June 2025
TLDR if you only add objects to an array that are already not contained in said array you won't have to iterate back through it to remove the duplicates you created.

wild that this a pattern like this would be part of git-core, but I guess we all overlook stuff on a regular basis

prymitive 6 June 2025
Well done, next maybe make the web ui usable because right now there’s absolutely no distinction between the UI itself and the user content, which IMHO combined with action buttons in the middle of the page makes for a really poor ux.
jeffbee 6 June 2025
Are there any reimplementations of git, by professional programmers using real tools? The source in question — object.c — is "banging rocks together" material.