How we decreased GitLab repo backup times from 48 hours to 41 minutes Hackernews Viewer

How we decreased GitLab repo backup times from 48 hours to 41 minutes

(about.gitlab.com)

592 points by nayak 6 June 2025 | 250 comments

Comments

divbzero 6 June 2025

The performance improvement that GitLab contributed to Git is slated to be released with v2.50.0:

https://github.com/git/git/commit/bb74c0abbc31da35be52999569...

LgWoodenBadger 6 June 2025

IME, it has always turned out to be the correct decision to eliminate any n^2 operation in anything I’ve written.

I don’t write exotic algorithms, but it’s always astounding how small n needs to be to become observably problematic.

SoftTalker 6 June 2025

Cool discovery but the article could have been about 1/10 as long and still communicated effectively. At least they didn't post it as a video, so it was easy to skim to the important details.

Arcuru 6 June 2025

48 hours is a crazy amount of time to spend just to compress a git folder, it's only a couple GB. 41 minutes still seems like quite a long time.

Why aren't they just snapshotting and archiving the full git repo? Does `git bundle` add something over frequent ZFS backups?

hiddew 6 June 2025

"fixed it with an algorithmic change, reducing backup times exponentially"

If the backup times were O(n^2), are they now O(n^2 / 2^n)? I would guess not.

bob1029 6 June 2025

I'm confused why you wouldn't simply snapshot the block-level device if the protocol of the information on top is going to cause this much headache. Quiescing git operations for block level activity is probably not trivial, but it sounds like an easier problem to solve to me.

This is the approach I've taken with SQLite in production environments. Turn on WAL and the problem gets even easier to solve. Customer configures the VM for snapshots every X minutes. Git presumably doesn't have something approximating a WAL, so I understand the hesitation with this path. But, I still think the overall strategy is much more viable and robust to weird edges within git.

hinkley 7 June 2025

This feels like some of the problems I’ve ended up tackling. The people too close to the problem don’t see it so someone has to reach across the aisle and make something happen in code they normally wouldn’t touch.

Even with the quadratic complexity I suspect this problem has been brewing for a while. This commit has been biding its time for 16 years waiting to ruin someone’s day (let’s not confuse the utility of Make it Work early in a project with justifying retaining that code in perpetuity. Make it Right, Make it Fast.) Backups have probably been going over six hours for years and steadily climbing.

“We” feels a lot like one or two people jumping on a grenade.

divbzero 6 June 2025

There seems to be a lesson here about the balance between premature vs. anticipatory optimization. We’re generally warned against premature optimization but perhaps, as a rule of thumb, we should look for optimizations in frequently-called functions that are obvious and not onerous to implement.

einpoklum 6 June 2025

I had a somewhat similar experience when writing a "remove duplicates" extension for Thunderbird:

https://addons.thunderbird.net/en-US/thunderbird/addon/remov...

I used a hash to begin with, using a simplistic digest function to the message headers I was comparing, getting me a 4-byte hash key.

That worked, but was kind of slow.

Finally, the idea came to me to not apply _any_ digesting, and use the combined concatenated headers of the messages as hash keys. About 2k bytes per hash key!

The result: About 20x perf improvement if memory serves.

How is that possible? The reason is that the code all runs in a Javascript machine; and applying the digest was not a built-in function, it was looping over the headers and doing the arithmetic. Thousands upon thousands of JS abstract machine steps. The use of the large hash key may be inefficient, but - it's just one JS object / dictionary operation, and one of the most heavily-optimized in any implementation.

gre 6 June 2025

looks like the commit is here: https://github.com/git/git/commit/a52d459e72b890c192485002ec...

pjmlp 6 June 2025

A very good example that writing code in C doesn't help for performance, when the algorithms or data structures aren't properly taken into consideration.

ashishb 6 June 2025

O(n^2) is fast enough to end up I'm production and slow enough to cause problems at scale.

The worst troublesome cases of inefficient production are almost always O(n^2).

edflsafoiewq 6 June 2025

So they replaced a nested for loop for checking for duplicates with a set data structure. Surprised such a common thing was in git.

SkiFire13 7 June 2025

> Ultimately, we traced the issue to a 15-year-old Git function with O(N²) complexity and fixed it with an algorithmic change, reducing backup times exponentially.

I have yet to finish the article, but this means they improved the complexity to something like O(logN), right? I hate when people confuse quadratic improvement for exponential ones.

judofyr 6 June 2025

Quadratic complexity sits in an awkward sweet spot: Fast enough for medium-sized n to pass first QA, but doomed to fail eventually as n grows.

kristianp 6 June 2025

It would have been nice if the post included some details, such as before and after code.

abhashanand1501 8 June 2025

Lot of comments complaining that going from O(n^2) to O(log n) is not an exponential improvement, but it is indeed an exponential improvement. In fact O(n) is exponentially more than O(log n).

tomxor 6 June 2025

One of the many reasons I moved to self hosting. I use ZFS to backup every 15 minutes, ... could do it even more frequently but that seems a little pointless.

Also moved away from Gitlab because it's so damn slow.

Lammy 6 June 2025

> What this means for GitLab customers — [a bunch of stuff about how customers can now back up more frequently and more robustly]

Realtalk: They should rewrite this post's headline to be in a positive tense instead of leading with a negative word. I'm glad I read the post, because it is a cool and good fix, but I saw “Decreasing […] repo backup” and my first thought was that it was an announcement of some service downgrade like some sort of cost-cutting measure.

ianks 7 June 2025

Reminds me of the timeless advice I got for acing leet code interviews: “remember to use a hash map.” Tends to work out pretty well.

SomaticPirate 6 June 2025

How was the flame graph created? (Not very familiar with C and the performance tools around it)

KETpXDDzR 7 June 2025

With git it should be straightforward to implement an incremental, dedup backup solution since all objects are stored with their hashs in filename.

katella 7 June 2025

I don't know much about git backups, but why would there be race conditions if you just create the backup off the local repo instead of remote?

treesknees 6 June 2025

Duplicate of https://news.ycombinator.com/item?id=44197061

DeepYogurt 7 June 2025

I miss a good tech blog post. Cheers for this

DamonHD 6 June 2025

Here comes an unpopular nitpick: "... we traced the issue to a 15-year-old Git function with O(N²) complexity and fixed it with an algorithmic change, reducing backup times exponentially."

Uh no you didn't. Not possible. At most a polynomial reduction is possible else complexity theory needs a re-write.

(OK, yes, k could be doing some heavy lifting here, but I doubt it.)

If you are going to quote a maths formula then please don't use "exponetially" to mean "lots".

I stopped reading there: I don't want to have to read each word and wonder if they actually meant it, or it's just bad hype.

gbacon 8 June 2025

Algorithmic improvements beat micro-optimizations.

sneak 6 June 2025

…and they say leetcode-style problems are out of date.

While perhaps implementing low level sorts is silly, knowing which data structures to use and when, and what underlying big-o performance they have, is critical, as demonstrated here.

iamcreasy 6 June 2025

Cool! Please post another flame graph after applying the patch.

jas39 7 June 2025

What a poor write-up, neither defining the problem nor the solution with any clearity. Did anyone come away with any information that can be used for anything?

niffydroid 6 June 2025

Why use this method over just having a script do a clone or pull elsewhere? Got is about distribution right?

James_K 8 June 2025

Perhaps nested for loops should be some kind of compiler warning.

fHr 7 June 2025

absolute chads, nice!

tcdent 6 June 2025

TLDR if you only add objects to an array that are already not contained in said array you won't have to iterate back through it to remove the duplicates you created.

wild that this a pattern like this would be part of git-core, but I guess we all overlook stuff on a regular basis

prymitive 6 June 2025

Well done, next maybe make the web ui usable because right now there’s absolutely no distinction between the UI itself and the user content, which IMHO combined with action buttons in the middle of the page makes for a really poor ux.

jeffbee 6 June 2025

Are there any reimplementations of git, by professional programmers using real tools? The source in question — object.c — is "banging rocks together" material.