OpenAI: Scaling PostgreSQL to the Next Level Hackernews Viewer

OpenAI: Scaling PostgreSQL to the Next Level

160 points by thunderbong 8 hours ago | 110 comments

Comments

samwillis 4 hours ago

I was at PGConf last week, and this was one of the most packed talks - a great insight into using Postgres, where most of the conference was fairly inward facing, with talks around the development of Postgres itself (pgconf.dev is very much that one, out of all the others each year).

What you have to remember is that for many teams, when their product takes off, they are not equipped with the deep internal knowledge of how to scale a particular part of their stack. This was an awesome story from a small team having to tackle those challenges, and how they were learning as they went. So, while there are some of those "can't you just", and "whats interesting about this?" comments here, with the narrative of the growth rate, and the very high profile of the product, it was the perfect user talk for a internal development focused conference.

The key insight, and main message, of the talk was that if you are not too write heavy you can scale Postgres to very high read throughput with read replicas and only a single master! That is exactly the message that needs to be spelled out as that covers the vast majority of apps.

As an observation, in the Q&A at the end of the talk the questions, primarily from core Postgres developers, were focused on learning about the use case, and not an opportunity to suggest that they were doing anything wrong (not quite the same as this thread could get). A genuinely awesome group of very friendly and welcoming people in the Postgres community.

tux3 7 hours ago

Self-hosting postgres is appealing from a flexibility standpoint (don't be locked out of superuser or advanced features), but it sounds a little bit nerve wracking.

I'm now hoping all the cloud providers read this article and start exposing the feature to try disabling an index in the query planner before you drop it for real, that should really become standard procedure. Just in case.

But if you're a large scale company to the point of wanting to own and customize your stack, it can definitely make sense to self-host.

cryptonector 3 hours ago

> Feature Requests

> Concerning schema changes: they desire PostgreSQL to record a history of schema change events, such as adding or removing columns and other DDL operations.

You can do this right now today by using `EVENT TRIGGER`s. You can check out things like Aquameta[0] (if I remember correctly) to see how it's done.

[0] https://github.com/aquametalabs/aquameta

bhouston 5 hours ago

Argh. Shard the damn database already.

Why are they not sharing by user/org yet? It is so simple and would fix the primary issue they are running into.

All these work arounds they go through to avoid a straight forward fix.

redwood 3 hours ago

What I find odd about this is there's no mention of all the other engines that must be in the mix powering different types of queries: I have no doubt they're using a little of everything, from scaling key-value to search, vector search, caches... They must be doing summersaults to avoid over-saturating this over-saturated Postgres env... yet only Postgres is discussed here.

davidkuennen 5 hours ago

They seem to be using physical replication. I'm currently thinking of switching to logical replication to reduce inter region egress cost.

Do you think that's a good idea? There seems to be many improvements to native logical replication since Postgres 17.

williamdclt 6 hours ago

Not super interesting, this is fairly basic stuff that you'll encounter at orders of magnitude smaller scale than OpenAI. Creating indexes CONCURRENTLY, avoiding table rewrites, smoothing out traffic, tx timeouts, read replicas... It's pretty much table stakes, even at 10000x smaller scale.

Their requests to Postgres devs aren't anything new either, everyone has wished for it for years.

The title is kind of misleading: they're not scaling it to the "next level", they're clearly struggling with this single-master setup and trying to keep it afloat while migrating off ("no new workloads allowed"). The main "next scale" point is that they say they can "scale gracefully under massive read loads" - nothing new, that's the whole point of read replicas and horizontal scaling.

Re: "Lao Feng Q&A":

> PostgreSQL actually does have a feature to disable indexes. You can simply set the indisvalid field to false in the pg_index system catalog [...] It’s not black magic.

No. It's not documented for this use, so it's not a feature. It's fooling around with internals without guarantees of what this will do (it might do what you want today, it might not in the next release). Plus as they point out, managed Postgres providers don't let you fiddle with this stuff (for good reasons, as this is not a feature).

> there’s a simpler solution [to avoiding accidental deletion of used indexes]: just confirm via monitoring views that the index is not being used on either primary or replicas

That doesn't quite solve all the same problems. It's quite frequent that an index is in use, but is not _needed_: another index would also work (eg you introduced a new index covering an extra column that's not used in this query). Being able to disable an index would allow checking that the query plan does use the other index, rather than praying and hoping.

vb-8448 5 hours ago

I wonder how much their performance can improve if they put the write-instance on dedicated servers (with local and very fast ssd) and use managed services only for read-replicas.

VWWHFSfQ 5 hours ago

> The presentation also specifically mentioned that using ORM can easily lead to inefficient queries and should be used cautiously.

Every ORM is bad. Especially the "any DB" ORMs. Because they trick you into thinking about your data patterns in terms of writing application code, instead of writing code for the database. And most of the time their features and APIs are abstracted in a way that basically means you can only use the least-common-denominator features of all the database backends that they can support.

I've sworn off ORMs entirely. My application is a Postgres application first and foremost. I use PG-specific features extensively. Why would I sacrifice all the power that Postgres offers me just for some conveniences in Python, or Ruby, or whatever?

Nah. Just write good SQL for your database and the whole system will be very happy.

airstrike 6 hours ago

Title should really be "Scaling PostgreSQL to the Next Level at OpenAI", which is the actual title of the talk

kabes 6 hours ago

"He finally concluded with some requests to the postgres developer community"

... You're one of the most well funded companies in the world, you shouldn't be asking for features to open aource devs, but you should be opening PRs

belter 6 hours ago

Many of these scaling issues would be solved, if they simply used AWS and Aurora Postgres.

https://pages.cs.wisc.edu/~yxy/cs764-f20/papers/aurora-sigmo...

https://youtu.be/pUqVCK7Ggh0

philosophty 3 hours ago

OpenAI and these companies hires inexperienced people with zero operational experience and this is how they run things. It's almost funny if you didn't see how unreliable the end result was.

Postgres is powerful but just not suited for this role. But if your only tool is a hammer...