More

ants_a · 2025-11-12T11:42:58 1762947778

For updating a single resource where the order of updates matters the best throughput one can hope for is the inverse of locking duration. Typical postgres using applications follow the pattern where a transaction involves multiple round trips between the application and the database to make decisions in the code running on the application server.

But this pattern is not required by PostgreSQL, it's possible to run arbitrarily complex transactions all on server side using more complex query patterns and/or stored procedures. In this case the locking time will be mainly determined by time-to-durability. Which, depending on infrastructure specifics, might be one or two orders of magnitude faster. Or in case of fast networks and slow disks, it might not have a huge effect.

One can also use batching in PostgreSQL to update the resource multiple times for each durability cycle. This will require some extra care from application writer to avoid getting totally bogged down by deadlocks/serializability conflicts.

What will absolutely kill you on PostgreSQL is high contention and repeatable read and higher isolation levels. PostgreSQL handles update conflicts with optimistic concurrency control, and high contention totally invalidates all of that optimism. So you need to be clever enough to achieve necessary correctness guarantees with read committed and the funky semantics it has for update visibility. Or use some external locking to get rid of contention in the database. The option for pessimistic locking would be very helpful for these workloads.

What would also help is a different kind of optimism, that would remove durability requirement from lock hold time, which would then result in readers having to wait for durability. Postgres can do tens of thousands of contended updates per second with this model. See the Eventual Durability paper for details.

ants_a · 2025-11-03T11:23:40 1762169020

I'm wondering if it would make sense to integrate the rim, motor and wheel bearing into a single assembly to save weight and cost. That combined with the weight and packaging benefits of not having half shafts and differentials might make it worth it. Plus there can be additional benefits, like the extra maneuverability that ZF Easy Turn and Hyundai's e-Corner have demonstrated.

30kW sustained/60 kW per wheel peak power is easily enough even for large passenger vehicles. Sustained could take 3 ton vehicle up a 10% grade at 120 km/h.

impossiblefork · 2025-11-04T07:38:45 1762241925

Things like that do exist though. There's an expensive Renault with motors like this, there's also the MW Motors Luka.

MW Motors eventually made a version where the electric motors were moved from the wheel hubs to a more conventional arrangement, so presumably they felt that it was some sort of problem, but they still make the original version and I've never been in one, so I can't be sure.

ants_a · 2025-07-11T10:43:39 1752230619

In that snippet are links to Postgres docs and two blog posts, one being the blog post under discussion. None of those contain the information needed to make the presented claims about throughput.

To make those claims it's necessary to know what work is being done while the lock is held. This includes a bunch of various resource cleanup, which should be cheap, and RecordTransactionCommit() which will grab a lock to insert a WAL record, wait for it to get flushed to disk and potentially also for it to get acknowledged by a synchronous replica. So the expected throughput is somewhere between hundreds and tens of thousands of notifies per second. But as far as I can tell this conclusion is only available from PostgreSQL source code and some assumptions about typical storage and network performance.

ilitirit · 2025-07-11T13:06:58 1752239218

> In that snippet are links to Postgres docs and two blog posts

Yes, that's what a snippet generally is. The generated document from my very basic research prompt is over 300k in length. There are also sources from the official mailing lists, graphile, and various community discussions.

I'm not going to post the entire outout because it is completely beside the point. In my original post, I expliclity asked "What is the qualitative and quantitative nature of relevant workloads?" exactly because it's not clear from the blog post. If, for example, they only started hitting these issues with 10k simultaneous reads/writes, then it's reasonable to assume that many people who don't have such high work loads won't really care.

The ChatGPT snippet was included to show that that's what ChatGPT research told me. Nothing more. I basically typed a 2-line prompt and asked it to include the original article. Anyone who thinks that what I posted is authoritative in any way shouldn't be considering doing this type of work.

ants_a · 2025-07-11T10:20:04 1752229204

Triggers are not even particularly slow. They just hide the extra work that is being done and thus sometimes come back to bite programmers by adding a ton of work to statements that look like they should be quick.

ants_a · 2025-05-22T13:09:52 1747919392

So my display aspect ratio is 2.5dB. Or is it 5dB because it's not measuring power?

ants_a · 2025-04-30T20:14:56 1746044096

That thread is indeed about the same issue. I don't think anyone has done a more concise writeup on it.

Core of the issue is that on the primary, commit inserts a WAL record, waits for durability, local and/or replicated, and then grabs a lock (ProcArrayLock) to mark itself as no longer running. Taking a snapshot takes that same lock and builds a list of running transactions. WAL insert and marking itself as visible can happen in different order. This causes an issue on the secondary where there is no idea of the apparent visibility order, so visibility order on secondary is strictly based on order of commit records in the WAL.

The obvious fix would be to make visibility happen in WAL order on the primary too. However there is one feature that makes that complicated. Clients can change the desired durability on a transaction-by-transaction basis. The settings range from confirm transaction immediately after it is inserted in WAL stream, through wait for local durability, all the way up to wait for it to be visible on synchronous replicas. If visibility happens in WAL order, then an async transaction either has to wait on every higher durability transaction that comes before it in the WAL stream, or give up on read-your-writes. That's basically where the discussion got stuck without achieving a consensus on which breakage to accept. This same problem is also the main blocker for adopting a logical (or physical) clock based snapshot mechanism.

By now I'm partial to the option of giving up on read-your-writes, with an opt-in option to see non-durable transactions as an escape hatch for backwards compatibility. Re-purposing SQL read uncommitted isolation level for this sounds appealing, but I haven't checked if there is some language in the standard that would make that a bad idea.

A somewhat elated idea is Eventual Durability, where write transactions become visible before they are durable, but read transactions wait for all observed transactions to be durable before committing.

aphyr · 2025-05-03T15:11:04 1746285064

Thanks to you both! I've updated the article to discuss this, and we've got an update on the AWS blog too. :-)

https://jepsen.io/analyses/amazon-rds-for-postgresql-17.4

ants_a · 2025-04-30T12:35:24 1746016524

Interesting why this magic would be needed. Vanilla Postgres does support quorum commit which can do this. You can also set up the equivalent multi-AZ cluster with Patroni, and (modulo bugs) it does the necessary coordination to make sure to promote primaries in a way that does not lose transactions or makes visible a transaction that is not durable.

There still is a Postgres deficiency that makes something similar to this pattern possible. Non-replicated transactions where the client goes away mid-commit become visible immediately. So in the example, if T1 happens on a partitioned leader, disconnects during commit, T2 also happens on a partitioned node, and T3 and T4 happen later on a new leader, you would also see the same result. However, this does not jive with the statement that fault injection was not done in this test.

Edit: did not notice the post that this pattern can be explained by inconsistent commit order on replica and primary. Kind of embarrassing given I've done a talk proposing how to fix that.

sontek · 2025-04-30T13:10:02 1746018602

Link the talk video

ants_a · 2025-04-30T20:24:11 1746044651

https://www.youtube.com/watch?v=vz-dhwSpjOw

ants_a · 2025-03-20T08:49:25 1742460565

Yes and no. On the primary durability order and visibility order are different. So an async transaction that starts committing later can become visible to readers before a sync transaction that comes before it.

A read-write sync transaction that reads the result of a non-durable transaction cannot commit ahead of the async transaction. But there is no such guarantee for read-only transactions. So a transaction could read a non-durable state, decide that no further action is needed, and have that decision be invalidated by a crash/failover.

To make things even worse, similar things can happen with synchronously replicated transactions. If a committing transaction waiting on replication is cancelled it will become visible immediately. A reasonably written application would run a retry of the transaction, see that the result is there and conclude that no action is necessary, even though the data is not replicated and would be lost if a failover then happens.

anonymousDan · 2025-03-20T18:15:39 1742494539

Yeah these are the kind of subtle issues I expected the article might be either glossing over or unaware of.

ants_a · 2025-03-20T08:39:27 1742459967

There is a paper exploring this concept: https://cs.uwaterloo.ca/~kdaudjee/ED.pdf

UI wise it does not make sense to have this distinction, as the window to get durability is a small fraction of a second. But for concurrent modifications the reduction in lock duration can mean an order of magnitude throughput.

ants_a · 2025-02-12T12:21:02 1739362862

Consumers already have the choice of not buying the most powerful card.

Increasing die size to run cores in a more power efficient regime is not going to work, because a) the chips are already as big as can be made, and b) competition will still push companies to run the 250 cores uber fast and figure out some way to push enough power to it.

As long as there is customer demand for this, these things will get built. Given the amount of bad press these melted connectors create, possibly with better engineered power delivery systems.