More

mbrt · on Nov 26, 2024

This builds on the same intuition I had, where data can be easily partitioned across objects. What seems to be missing is transactions across different objects though?

The flipside is that Cloudflare DO will be a lot faster.

Interesting that all these similar solutions are popping out now.

I think it would be interesting to combine a SQLite per-object approach with transactions on top of different objects.

mbrt · on Nov 26, 2024

You're right indeed:) but it depends on what you are comparing it with. In this case the comparison is against other managed cloud storage and databases, and in that context I think the claim holds.

Is it the cheapest possible storage in existence? No, if you take raw disks and put them in a rack, but I also feel it wouldn't be an entirely fair comparison.

eek2121 · on Nov 26, 2024

S3 is one of the most expensive platforms out there, however. Look at backblaze B2 for an example of just HOW expensive S3 is.

When i moved from S3 to DO, my bill went from hundreds to $20/mo. The only thing that changed was the hosting provider.

mbrt · on Nov 26, 2024

B2 is mostly S3-compatible, so if they add the same support for preconditions on writes as S3 and GCS, nothing prevents using it as a backend for GlassDB.

mbrt · on Nov 26, 2024

It's a good observation, because I did and decided to keep it out of scope from the base layer.

But this is entirely possible. You can wrap GlassDB transactions and encode multiple keys into the same object at a higher level. Transactions across different objects will still preserve the same isolation.

The current version is meant to be a base from which to build higer level APIs, somewhat like FoundationDB.

mbrt · on Nov 25, 2024

Nice, thanks for the reference!

BTW, the comparison was only to give an idea about isolation levels, it wasn't meant to be a feature-to-feature comparison.

Perhaps I didn't make it prominent enough, but at some point I say that many SQL databases have key-value stores at their core, and implement a SQL layer on top (e.g. https://www.cockroachlabs.com/docs/v22.1/architecture/overvi...).

Basically SQL can be a feature added later to a solid KV store as a base.

mbrt · on Nov 19, 2024

Wow, coincidentally I posted GlassBD (https://news.ycombinator.com/item?id=42164058) a couple of days ago. Making S3 strongly consistent is not trivial, so I'm curious about how you achieved this.

If the caching layer can return success before writing through to s3, it means you built a strongly consistent distributed in memory database.

Or, the consistency guarantee is actually less, or data is partitioned and cannot be quickly shared across clients.

I'm really curious to understand how this was implemented.

huntaub · on Nov 19, 2024

Hey, thanks for reaching out. The caching layer does return success before writing to S3 -- that's how we get good performance for all operations, including those which aren't possible to do in S3 efficiently (such as random writes, renames, or file appends). Because the caching layer is durable, we can safely asynchronously apply these changes to the S3 bucket. Most operations appear in the S3 bucket within a minute!

mbrt · on Nov 19, 2024

Very nice, I like the approach. I assume data is partitioned and each file is handled by an elected leader? If data is replicated, you still need a consensus algorithm on updates.

How are concurrent updates to the same file handled? Either only one client can open in write at any one time, or you need fencing tokens.

huntaub · on Nov 19, 2024

Without getting too much into internals which could change at any time, yes. You have to replicate, partition, and serve consensus over data to achieve high-durability and availability.

For concurrent updates, the standard practice for remote file systems is to use file locking to coordinate concurrent writes. Otherwise, NFS doesn't have any guarantees about WRITE operation ordering. If you're talking about concurrent writes which occur from NFS and S3 simultaneously, this leads to undefined behavior. We think that this is okay if we do a good job at detecting and alerting the user if this occurs because we don't think that there are applications currently written to do this kind of simultaneous data editing (because Regatta didn't exist yet).

mbrt · on Nov 19, 2024

Thanks for the details!

Consistency at the individual file can be guaranteed this way, but I don't think this works across multiple files (as you need a global total order of operations). In any case, this is a pragmatic solution, and I like the tradeoffs. Comparing against NFS rather than Spanner seems the right way to look at it.

huntaub · on Nov 19, 2024

This is actually also interesting, in that I don’t think that the file system paradigm actually requires a global total ordering of operations (and, in fact, many file systems don’t provide this). I know that sounds like snapshots wouldn’t be valid, but I think that applications which really care about data consistency (such as databases) are built specifically to handle this (with things like write-ahead-logs).

ignoramous · on Nov 19, 2024

Regatta is a write-through cache for s3 bucket under its supervision? I guess then external changes to that bucket is a no-no?

Any plans to expand to other stores, like R2 (I ask since unlike S3, R2 egress is free)?

huntaub · on Nov 19, 2024

Hey there, that's sort of the correct way to think about it -- notably that our caching layer is high-durability, so we can keep recent writes in the cache safely. External changes to the bucket are okay! Lots of customers need to (for example) ingest data into S3, then process it on a file system, and that totally works. The only thing that isn't supported is editing the same file from both S3 and the file system simultaneously. We think this is a super rare case, and probably doesn't exist today (because there isn't anything that bridges S3 and file semantics yet).

We support all S3-compatible storage services today, including R2, GCS, and MinIO.

ignoramous · on Nov 20, 2024

I actually asked about R2 to see if Regatta's pricing is any different as there's no egress fee. I should have been clearer.

btw, thanks a bunch for answering my Q & everyone else's too (except for parts where you couldn't talk about the implementation, understandably so). Appreciate it. Wishing the best.

mbrt · on Nov 18, 2024

I think it's now much easier to achieve than a year ago. The critical one is conditional writes on new objects, because otherwise you can't safely create transaction logs in the presence of timeouts. This is not enough though.

My approach on S3 would be to ensure to modify the ETag of an object whenever other transactions looking at it must be blocked. This makes it easier to use conditional reads (https://docs.aws.amazon.com/AmazonS3/latest/userguide/condit...) on COPY or GET operations.

For write, I would use PUT on a temporary staging area and then conditional COPY + DELETE afterward. This is certainly slower than GCS, but I think it should work.

Locking without modifying the object is the part that needs some optimization though.

mbrt · on Nov 25, 2024

And I see more possibilities now that https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3... is available. It will get easier and easier to build serverless data lakes, streaming, queues.

mbrt · on Nov 18, 2024

It's my understanding that the newer generation of data lakes still make use of a tiny, strongly consistent metadata database to keep track of what is where. This is orders of magnitudes smaller than what you'd have by putting everything in the same database, but it's still there. This is also the case in newer data streaming platforms (e.g. https://www.warpstream.com/blog/kafka-is-dead-long-live-kafk...).

I'm curious to hear if you have examples of any database using only object storage as a backend, because back when I started, I couldn't fin any.

eatonphil · on Nov 25, 2024

> I'm curious to hear if you have examples of any database using only object storage as a backend, because back when I started, I couldn't fin any.

Take a look at Delta Lake

https://notes.eatonphil.com/2024-09-29-build-a-serverless-ac...

mbrt · on Nov 26, 2024

Wow, not sure how I missed this, but I see many similarities. They were also bitten by lack of conditional writes in S3:

> In Databricks service deployments, we use a separate lightweight coordination service to ensure that only one client can add a record with each log ID.

The key difference is that Delta Lake implements MVCC and relies on total ordering of transaction IDs. Something I didn't want to do to avoid forced synchronization points (multiple clients need to fight for IDs). This is certainly a trade-off, because in my case you are forced to read the latest version or retry (but then you get strict serializability), while in Delta Lake you can rely on snapshot isolation, which might give you slightly stale, but consistent data and minimize retries on reads.

It also seems that you can't get transactions across different tables? Another interesting tradeoff.

Onavo · on Nov 18, 2024

Love your article by the way. Not an expert but off the top of my head:

https://docs.datomic.com/operation/architecture.html

(However they cheat with dynamo lol)

There's also some listed here

https://davidgomes.com/separation-of-storage-and-compute-and...

mbrt · on Nov 18, 2024

OK, thanks for the reference. Yeah, so indeed separating storage and compute is nothing new. Definitely not claiming I invented that :)

And as you mention, Datomic uses DynamoDB as well (so, not a pure s3 solution). What I'm proposing is to only use object storage for everything, pay the price in latency, but don't give up on throughput, cost and consistency. The differentiator is that this comes with strict serializability guarantees, so this is not an eventually consistent system (https://jepsen.io/consistency/models/strong-serializable).

No matter how sophisticated the caching is, if you want to retain strict serializability, writes must be confirmed by s3 and reads must validate in s3 before returning, which puts a lower bound on latency.

I focused a lot on throughput, which is the one we can really optimize.

Hopefully that's clear from the blog, though.

Onavo · on Nov 18, 2024

Have you seen https://news.ycombinator.com/item?id=42174204

mbrt · on Nov 19, 2024

I just saw it! I asked a question (https://news.ycombinator.com/item?id=42180611) and it seems that durability and consistency are implemented at the caching layer.

Basically an in-memory database which uses S3 as cold storage. Definitely an interesting approach, but no transactions AFAICT.

vineyardmike · on Nov 26, 2024

> if you have examples of any database using only object storage as a backend

I think DuckDB is very close to this. It's a bit different, because it's mostly for read-heavy workloads.

https://duckdb.org/docs/extensions/httpfs/s3api

(BTW great article, excellent read!)

mbrt · on March 29, 2020

Author here. I understand your concerns, and in fact one of the last things I added is unit tests: https://github.com/mbrt/gmailctl/blob/master/README.md#tests

Getting things right is not easy, especially in complex languages like jsonnet. In my own experience though, every time you limit the expressivity of config languages you end up inventing even more complex workarounds in the end (e.g. external templating and things like that).

mbrt · on Dec 5, 2016

I don't know any about Rust, but indeed Fabien has a really good one about game engines.

mbrt · on Dec 5, 2016

Yes, I used Inkscape, but I'm not a designer, so I bet you can do something similar with a little bit of effort :) I'm not a fan of high level tools... they usually give a correct result, but a crappy look and feel...