More

mildbyte · 2025-07-07T17:27:08 1751909228

Funnily enough, I did a Sudoku one too (albeit with Poetry) a few years ago: https://github.com/mildbyte/poetry-sudoku-solver

mildbyte · on June 8, 2024

Another difference is that this solution uses parquet_fdw, which handles fast scans through Parquet files and filter pushdown via row group pruning, but doesn't vectorize the groupby / join operations above the table scan in the query tree (so you're still using the row-by-row PG query executor in the end).

pg_analytics uses DataFusion (dedicated analytical query engine) to run the entire query, which can achieve orders of magnitude speedups over vanilla PG with indexes on analytical benchmarks like TPC-H. We use the same approach at EDB for our Postgres Lakehouse (I'm part of the team that works on it).

wanderinglight · on June 8, 2024

This is definitely something I intend to fix.

My initial intent was to use duckdb for fast vectored query execution but I wasn't able to create a planner / execution hook that uses duckdb internally. Will definitely checkout pg_analytics / Datafusion to see if the same can be integrated here as well. Thanks for the pointers.

mildbyte · on June 8, 2024

Have you seen duckdb_fdw (https://github.com/alitrack/duckdb_fdw)? IIRC it's built based on sqlite_fdw, but points the outbound queries to DuckDB instead of SQLite, and it does handle running aggregations inside of DuckDB. Could be useful.

wanderinglight · on June 8, 2024

This is great, thank so much! I'll see if I can I can integrate this and how it compares to parquet_fdw.

mildbyte · on Jan 31, 2024

Damn, literally a day after I wrote up my experiments[0] with LLaVA 1.5 and computing image embeddings. Interesting to see the performance with the fine-tuned Mistral-7B variant being pretty close to the one with Vicuna-13B - using Mistral 7B is what BakLLaVA did back with LLaVA 1.5.

[0] https://mildbyte.xyz/blog/llama-cpp-python-llava-gpu-embeddi...

mildbyte · on Jan 6, 2023

You can indeed run LiteFS by yourself, without Consul, as a sidecar / wrapper around your application. We do it in our project and have a Docker Compose example at [0]. In this case, you specify a specific known leader node. We haven't tried getting it running independently with Consul to do leader election / failover.

[0] https://github.com/splitgraph/seafowl/blob/main/examples/lit...

mildbyte · on Jan 2, 2023

I mentioned it recently[0], but this looks like a very good topic to plug our new database, Seafowl, that we released last year [1]. It also uses Apache DataFusion (like IOx) and separates storage and compute (like Neon, Snowflake etc) but is designed for client-side Web apps to run analytical SQL queries over HTTP (using semantics that make the query results cacheable by browser caches and CDNs). This makes it really useful for things like interactive visualizations or dashboards.

We're currently doing a lot of work at Splitgraph to reposition the product around this "analytics at the edge" use case, with usage-based billing, eventually moving our query execution from PostgreSQL to Seafowl.

[0] https://news.ycombinator.com/item?id=34175545

[1] https://seafowl.io

zX41ZdbW · on Jan 2, 2023

You can use ClickHouse as a stateless query engine to process externally hosted datasets.

This way you just publish a MergeTree table to S3 bucket or a set of static files on an HTTP server, and then plugging it on an empty ClickHouse server.

ClickHouse also wins in the comparison of serverless engines: https://datastudio.google.com/u/0/reporting/c870737c-e8b6-46...

Discussed here: https://github.com/ClickHouse/ClickHouse/issues/43589#issuec...

ddorian43 · on Jan 2, 2023

Your first link is not the correct one (click the timestamp of the comment to get the exact link)

mildbyte · on Jan 2, 2023

Thanks for the catch! Fixed.

mildbyte · on Dec 29, 2022

> why not just accept SQL and cut out all the unnecessary mapping?

You might be interested in what we're building: Seafowl, a database designed for running analytical SQL queries straight from the user's browser, with HTTP CDN-friendly caching [0]. It's a second iteration of the Splitgraph DDN [1] which we built on top of PostgreSQL (Seafowl is much faster for this use case, since it's based on Apache DataFusion + Parquet).

The tradeoff for allowing the client to run any SQL vs a limited API is that PostgREST-style queries have a fairly predictable and low overhead, but aren't as powerful as fully-fledged SQL with aggregations, joins, window functions and CTEs, which have their uses in interactive dashboards to reduce the amount of data that has to be processed on the client.

There's also ROAPI [2] which is a read-only SQL API that you can deploy in front of a database / other data source (though in case of using databases as a data source, it's only for tables that fit in memory).

[0] https://seafowl.io/

[1] https://www.splitgraph.com/connect

[2] https://github.com/roapi/roapi

mildbyte · on Dec 9, 2022

It's possible! Currently this is running GROUP BY queries using Socrata's query API on the original government data portal. We're adding the ability to import data from these sources into a columnar format in the future, either into Splitgraph itself or syncing the data out into Seafowl (https://seafowl.io/) which uses Parquet and is much faster.

Technically, the ability is already there (you can add a dataset to Splitgraph and select Socrata as a source if you know the dataset ID), but it's not as turnkey as landing on a dataset page and clicking a button. More to come!

mildbyte · on Nov 3, 2022

And if you're looking for a similar experience (very fast analytical SQL queries) but over HTTP, for example, to power a public dashboard or a visualization, you can try ROAPI [0] or Seafowl [1], also built on top of DataFusion (disclaimer: working on Seafowl):

[0]: https://github.com/roapi/roapi

[1]: https://github.com/splitgraph/seafowl

cube2222 · on Nov 3, 2022

That's what I thought about ROAPI as well, until I benchmarked it, and it ended up being very slow[0].

[0]: https://news.ycombinator.com/item?id=32970495

mildbyte · on Nov 3, 2022

It could be the NDJSON parser (DF source: [0]) or could be a variety of other factors. Looking at the ROAPI release archive [1], it doesn't ship with the definitive `columnq` binary from your comment (EDIT: it does, I was looking in the wrong place! https://github.com/roapi/roapi/releases/tag/columnq-cli-v0.3...), so it could also have something to do with compilation-time flags.

FWIW, we use the Parquet format with DataFusion and get very good speeds similar to DuckDB [2], e.g. 1.5s to run a more complex aggregation query `SELECT date_trunc('month', tpep_pickup_datetime) AS month, COUNT(*) AS total_trips, SUM(total_amount) FROM tripdata GROUP BY 1 ORDER BY 1 ASC)` on a 55M row subset of NY Taxi trip data.

[0]: https://github.com/apache/arrow-datafusion/blob/master/dataf...

[1]: https://github.com/roapi/roapi/releases/tag/roapi-v0.8.0

[2]: https://observablehq.com/@seafowl/benchmarks

cube2222 · on Nov 3, 2022

Yes, DataFusion itself is definitely fast, no denying that.

mildbyte · on Oct 26, 2022

Just wanted to also give a shout out to Apache DataFusion[0] that IOx relies on a lot (and contributes to as well!).

It's a framework for writing query engines in Rust that takes care of a lot of heavy lifting around parsing SQL, type casting, constructing and transforming query plans and optimizing them. It's pluggable, making it easy to write custom data sources, optimizer rules, query nodes etc.

It's has very good single-node performance (there's even a way to compile it with SIMD support) and Ballista [1] extends that to build it into a distributed query engine.

Plenty of other projects use it besides IOx, including VegaFusion, ROAPI, Cube.js's preaggregation store. We're heavily using it to build Seafowl [2], an analytical database that's optimized for running SQL queries directly from the user's browser (caching, CDNs, low latency, some WASM support, all that fun stuff).

[0] https://github.com/apache/arrow-datafusion

[1] https://github.com/apache/arrow-ballista

[2] https://github.com/splitgraph/seafowl

pauldix · on Oct 26, 2022

DataFusion is great, we're happy to be contributing to it. Also excited to see so many people around the world picking it up and contributing as well. With our development efforts on IOx, it's like a strong tailwind. But we put a ton of effort into helping manage community efforts (thanks, alamb! our developer on IOx that is also on the Arrow PMC).

andygrove · on Oct 26, 2022

Original author of DataFusion/Ballista here. Having alamb and others from InfluxData involved has been a huge help in driving the project forward and helping build an active community behind the project. It is genuinely hard to keep up with the momentum these days!

menaerus · on Oct 26, 2022

Hi, I just had a glance over the DataFusion project. Very interesting work out there which I will be definitely keeping the track of but I've got a genuine question. Do you sometimes find development in Rust a little bit challenging for large-scale and performance sensitive type of work?

I say this because I've noticed more than several PRs fixing (large) performance regressions which to my understanding were mostly introduced due to unforeseen or unexpected Rust compiler subtleties which would then lead to less than optimal code generation. One example of such event was a naive and simply looking abstraction that was introduced and which brought down the performance by something like 50% in TPC-H benchmarks. This really struck me a little bit, especially because it seems quite hard to identify the root cause, and I would like to hear the experiences from the first hand. Thanks a bunch!

nevi-me · on Oct 26, 2022

Your initial experiments and decision to build on arrow-rs has been great for the project. Thank you and everyone involved.

ignoramous · on Oct 26, 2022

> We're heavily using it to build Seafowl, an analytical database that's optimized for running SQL queries directly from the user's browser...

Interesting. Where does seafowl fit in when I compare it with, say, data-stack-in-a-box approach, for ex: meltano + dbt + duckdb + superset [0]? Is my thinking right that seafowl possibly replaces both duckdb (with IOx) and superset (if there's a web front-end)?

Incidentally, dagster had an article up just yesterday making a case for poor-man's datalake with dbt + dagster + duckdb [1]. What does splitgraph replace if I were to use it in a similar setup?

Thanks.

[0] https://archive.is/DxU1e

[1] https://archive.is/5ikU4

mildbyte · on Oct 26, 2022

Great question! With Seafowl, the idea is different from what the modern data stack addresses. It's trying to simplify public-facing Web-based visualizations: apps that need to run analytical queries on large datasets and can be accessed by users all around the world. This is why we made the query API easily cacheable by CDNs and Seafowl itself easy to deploy at the edge, e.g. with Fly.io.

It's a fairly different use case from DuckDB (query execution for Web applications vs fast embedded analytical database for notebooks) and the rest of the modern data stack (which mostly is about analytics internal to a company). Just to clarify, we're not related to IOx directly (only via us both using Apache DataFusion).

If we had to place Seafowl _inside_ of the modern data stack, it'd be mostly a warehouse, but one that is optimized for being queried from the Internet, rather than by a limited set of internal users. Or, a potential use case could be extracting internal data from your warehouse to Seafowl in order to build public applications that use it.

We don't currently ship a Web front-end and so can't serve as a replacement to Superset: it's exposed to the developer as an HTTP API that can be queried directly from the end user's Web browser. But we have some ideas around a frontend component: some kind of a middleware, where the Web app can pre-declare the queries it will need to run at build time and we can compute some pre-aggregations to speed those up at runtime. Currently we recommend querying it with Observable [0] for an end-to-end query + visualization experience (or use a different viz library like d3/Vega).

Re: the second question about Splitgraph for a data lake, the intention behind Splitgraph is to orchestrate all those tools and there the use case is indeed the modern data stack in a box. It's kind of similar to dbt Labs's Sinter [1] which was supposed to be the end-to-end data platform before they focused on dbt and dbt Cloud instead: being able to run Airbyte ingestion, dbt transformations, be a data warehouse (using PostgreSQL and a columnar store extension), let users organize and discover data at the same time. There's a lot of baggage in Splitgraph though, as we moved through a few iterations of the product (first Git/Docker for data, then a platform for the modern data stack). Currently we're thinking about how to best integrate Splitgraph and Seafowl in order to build a managed pay-as-you-go Seafowl, kind of like Fauna [2] for analytics.

Hope this helps!

[0] https://observablehq.com/@seafowl/interactive-visualization-...

[1] https://www.getdbt.com/blog/whats-in-a-name/

[2] https://fauna.com/

mildbyte · on Oct 21, 2022

We managed to get it working for analytical SQL queries [1] for our database. It's kind of questionable whether this is an intended use of Cloudflare, but it feels like it, since we use it for query results, which are essentially a static asset in our use case (dashboards and visualizations).

[1] https://www.splitgraph.com/blog/seafowl-sql-cdn