While postgres is indeed case sensitive usually writing queries with keywords in all caps is an effort to increase legibility for visual pattern matching. It absolutely isn't needed but if I'm debugging a query of yours I will send it through my prettifier so that I can breeze through your definitions without getting hung up on minor weird syntax things.
It's like prettification in any other language - visual structures that we can quickly recognize (like consistent indentation levels) make us waste less time on comprehension of the obvious so we can focus on what's important.
The only thing I really object to is "actuallyUsingCaseInIdentifiers" I never want to see columns that require double quotes for me to inspect on cli.
A lot of these aren't postgres-specific. (null weirdness, index column order, etc.)
For example, how nulls work - especially how interact with indexes and unique constraints - is also non-intuitive in mysql.
If you have a user table with a non-nullable email column and a nullable username column, and a uniqueness constraint on something like (email, username), you'll be able to insert multiple identical emails with a null username into that table - because a null isn't equivalent to another null.
I didn't know this when I started, so I never vacuumed the reddit databases. Then one day I was forced to, and it took reddit down for almost a day while I waited for it to finish.
> Normalize your data unless you have a good reason not to
Ouch. You don't want to just say that and move on.
The author even linked to a page citing 10 different kinds of normalization (11 with the "non-normalized"). Most people don't even know what those are, and have no use for 7 of those. Do not send people on wild-goose chases after those higher normal forms.
this is really nice. i am glad the author put it together. i didn't know the pg docs were 3200 pages long! i have been using it for a while and try to learn as i go. i really do like the docs. and i also like to read articles on various particular subjects as i find a need to.
i think the author might find it helpful to readers to add a note to https://challahscript.com/what_i_wish_someone_told_me_about_... that if someone is selecting for b alone, then an index on columns (b, a) would work fine. i think this is kind of implied when they talk about selecting on a alone, but maybe it wouldn't hurt to be extra explicit.
(i didn't spend much time on the json/jsonb part since i rarely use that stuff)
From my experience with a lot of hilarious SQL stuff I have seen in the wild.
It would be a good start to read the paper of codd and trying to understand what the relational model is. It's only 11 pages long and doing that would reduce the suffering in this world.
Nice article! One thing I'd add is that almost all of it applies to other MVCC databases like MySQL too. While some details might be different, it too suffers from lon transactions, holds metadata locks during ALTERs, etc, all the good stuff :).
I had an interesting problem occur to the pg stats. We were migrating and had a version column, I.e key, val, version.
We were migrating from version 1 to version 2, double writing into the same table. An index on (key, val, version) was being hit by our reader process using a where clause like key=k and version=1.
When we flipped the reader to read version 2, the latency jumped from 30ms to 11s. Explain showed a sequential scan even though the index could serve the query. I was able to use MATERIALIZED and reorder CTEs to get the planner to do the right thing, but it caused an outage.
We were autovacuuming as well. I ended up deleting the old version and rebuilding the index.
My theory is that because the read load was on 50% of the data, the stats were super skewed.
Only some of these are really Postgres specific (use "text" / "timestamptz"; make psql more useful; copy to CSV). Most of them apply to relational databases in general (learn how to normalise!; yes, NULL is weird; learn how indexes work!; locks and long-running transactions will bite you; avoid storing and querying JSON blobs of doom). Not that that detracts from the usefulness of this article - pretty much all of them are important things to know when working with Postgres.
> It’s possible that adding an index will do nothing
This is one of the more perplexing thing to me where Postgres ideology is a bit too strong, or at least, the way it works is too hard for me to understand (and I've tried - I'm not going to claim I'm a genius but I'm also not a moron). I hear there may be finally support for some kind of hints in upcoming versions, which would be very welcome to me. I've spent way too much time trying to divine the voodoo of why a slow query is not using indexes when it seems obvious that it should.
That nested JSON query operator chains such as json_col->'foo'->'bar'->>'baz' internally return (copy) entire sub-objects at each level and can be much slower than jsonb_path_query(json_col, '$.foo.bar.baz') for large JSONB data
... although I haven't had the chance to verify this myself
since we are on the topic and since your article clearly mentions "Normalize your data unless you have a good reason not to" I had to ask. I am trying to build a news aggregator and I have many websites. Each of them has slightly different format. Even though I use feedparser in python, it still doesn't change how some of them put html text inside content and some of them break it down into a separate media xml attribute while retaining only basic textual summary inside a summary attribute. Do you think it makes more sense to store a single rss item as a column inside postgres or should it be stored after parsing it? I can see upsides and downsides to both approaches. Store it as XML and you have the ability to change your processing logic down the line for each stored item but you lose the flexibility of querying metadata and you also have to parse it on the fly every single time. Store it in multiple columns after processing it and it may require different logic for different websites + changing your overall python processing logic requires a lot of thinking on how it might affect some source. What do you guys recommend?
I'd say I tend to ignore the standard docs because they rarely have examples and rely on the arcane procedure of trying to decipher the super command options with all it's "[OR THIS|THAT]".
I assume _someone_ can read this pseudo programming, but it's not me.
Watch out, for row/record values, if a column in the row/record is NULL then IS NULL will be true! You want to use IS [NOT] DISTINCT FROM NULL, full stop.
From reading this, it's clear that the author never sat down and learned to use databases from the ground up. The author started using them and learned as he went, so his "tips" include things you'll be told in the first hour of any course.
This doesn't hold any value for someone who's been using databases for almost any length of time.
What I wish someone told me about Postgres
(challahscript.com)458 points by todsacerdoti 12 November 2024 | 190 comments
Comments
It's like prettification in any other language - visual structures that we can quickly recognize (like consistent indentation levels) make us waste less time on comprehension of the obvious so we can focus on what's important.
The only thing I really object to is "actuallyUsingCaseInIdentifiers" I never want to see columns that require double quotes for me to inspect on cli.
[0] https://wiki.postgresql.org/wiki/Don%27t_Do_This
For example, how nulls work - especially how interact with indexes and unique constraints - is also non-intuitive in mysql.
If you have a user table with a non-nullable email column and a nullable username column, and a uniqueness constraint on something like (email, username), you'll be able to insert multiple identical emails with a null username into that table - because a null isn't equivalent to another null.
I didn't know this when I started, so I never vacuumed the reddit databases. Then one day I was forced to, and it took reddit down for almost a day while I waited for it to finish.
Ouch. You don't want to just say that and move on.
The author even linked to a page citing 10 different kinds of normalization (11 with the "non-normalized"). Most people don't even know what those are, and have no use for 7 of those. Do not send people on wild-goose chases after those higher normal forms.
i think the author might find it helpful to readers to add a note to https://challahscript.com/what_i_wish_someone_told_me_about_... that if someone is selecting for b alone, then an index on columns (b, a) would work fine. i think this is kind of implied when they talk about selecting on a alone, but maybe it wouldn't hurt to be extra explicit.
(i didn't spend much time on the json/jsonb part since i rarely use that stuff)
It would be a good start to read the paper of codd and trying to understand what the relational model is. It's only 11 pages long and doing that would reduce the suffering in this world.
Because 'null' in the JSON spec is a literal value (a constant), not SQL's NULL. Nothing to see here.
https://datatracker.ietf.org/doc/html/rfc7159
Great and super useful notes
We were migrating from version 1 to version 2, double writing into the same table. An index on (key, val, version) was being hit by our reader process using a where clause like key=k and version=1.
When we flipped the reader to read version 2, the latency jumped from 30ms to 11s. Explain showed a sequential scan even though the index could serve the query. I was able to use MATERIALIZED and reorder CTEs to get the planner to do the right thing, but it caused an outage.
We were autovacuuming as well. I ended up deleting the old version and rebuilding the index.
My theory is that because the read load was on 50% of the data, the stats were super skewed.
This is one of the more perplexing thing to me where Postgres ideology is a bit too strong, or at least, the way it works is too hard for me to understand (and I've tried - I'm not going to claim I'm a genius but I'm also not a moron). I hear there may be finally support for some kind of hints in upcoming versions, which would be very welcome to me. I've spent way too much time trying to divine the voodoo of why a slow query is not using indexes when it seems obvious that it should.
... although I haven't had the chance to verify this myself
Don't <thing>
Why not?
When should you?
format of the Postgres Don't Do This page.
I assume _someone_ can read this pseudo programming, but it's not me.
This bit me. It's highly tedious that case isn't preserved honestly.
Oracle enters the chat... (last I used it it considered an empty string '' the same as NULL)
At this point "journey" is a cringe word because of it's excessive usage in blogspamverts. It tells me you are a gross aspiring influencer.
>"You don't need to write SQL in uppercase"
>"What's an index ?" section
From reading this, it's clear that the author never sat down and learned to use databases from the ground up. The author started using them and learned as he went, so his "tips" include things you'll be told in the first hour of any course.
This doesn't hold any value for someone who's been using databases for almost any length of time.
Then ask it to rewrite the query three different ways and explain the pros and cons of each.
Do the same for any existing queries in your system … drop in the schema then the query and ask for an analysis and suggested improvements.