Hacker Newsnew | past | comments | ask | show | jobs | submit | pauldix's commentslogin

I believe you could do this effectively with COBS (COmpact Bit Sliced signature index): https://panthema.net/2019/1008-COBS-A-Compact-Bit-Sliced-Sig...

It's a pretty neat algorithm from a paper in 2019 for the application "to index k-mers of DNA samples or q-grams from text documents". You can take a collection of bloom filters built for documents and then combine them together to have a single filter that will tell you which docs it maps to. Like an inverted index meets a bloom filter.

I'm using it in a totally different domain for an upcoming release in InfluxDB (time series database).

There's also code online here: https://github.com/bingmann/cobs


I've been following this team's work for a while and what they're doing is super interesting. The file format they created and put into the LF, Vortex, is very welcome innovation in the space: https://github.com/vortex-data/vortex

I'm excited to start doing some experimentation with Vortex to see how it can improve our products.

Great stuff, congrats to Will and team!


https://vortex.dev doesn't work in my Firefox:

Application error: a client-side exception has occurred while loading vortex.dev (see the browser console for more information).

Console: unable to create webgl context


Presumably you don't have WebGL enabled or supported - the main page is just a cute 3D landing page.

You may be interested in https://github.com/vortex-data/vortex which of course has an overview and links to their docs and benchmark pages.


Works for me. Mozilla/5.0 (X11; Linux x86_64; rv:142.0) Gecko/20100101 Firefox/142.0


If anyone ever writes a post of why that error keeps happening with browsers that should support it, I'd be incredibly grateful. Keep seeing it in our (unrelated to OP company) Sentry logs and zero chance to reproduce them.


Handful of causes:

+ No hardware acceleration enabled.

+ Multiple graphics cards, and browser can't decide which to use.

+ Race conditions that can rarely cause a mount of 3d onto a 2d context (often happens to Unity).


Privacy plugins which disable WebGL (fingerprinting)


I assume it's just people who do not have a graphics card


InfluxDB Founder & CTO here. We worked hard to support InfluxQL in 3.x and it supports the v1 write API. Admittedly, it will be a migration to move and we haven't yet built the tooling, but we felt it was important to get the 3.0 release out even though we don't have the migration tooling built yet. Our plan is to have that available later this year.

The 2.x to 3.x move is, admittedly, much harder. This is because of the language Flux. We haven't been able to bring that over to 3.x in a way that makes it useful. We actually built a bridge for it in our cloud offering, but our experience is that the performance isn't good enough to be acceptable for customers wanting to upgrade. If they want to make the move, adopting SQL or InfluxQL is likely the only path.

We'll continue to develop 3.x and we'll build more migration tooling over time. I think we can build specialized tooling to help Flux users migrate over to 3.x with query translation tools, but there are more features we need to land in 3.x to enable that first.

We're committed to the technology stack (Apache Arrow & DataFusion) and the 3.x line. We have no plans for another major release. I'll be happy if we end up releasing 3.56.2 8 years from now.


My experience so far with Opus 4 is that it's very good. Based on a few days of using it for real work, I think it's better than Sonnet 3.5 or 3.7, which had been my daily drivers prior to Gemini 2.5 Pro switching me over just 3 weeks ago. It has solved some things that eluded Gemini 2.5 Pro.

Right now I'm swapping between Gemini and Opus depending on the task. Gemini's 1M token context window is really unbeatable.

But the quality of what Opus 4 produces is really good.

edit: forgot to mention that this is all for Rust based work on InfluxDB 3, a fairly large and complex codebase. YMMV


I've been having really good results from Jules, which is Google's gemini agent coding platform[1]. In the beta you only get 5 tasks a day, but so far I have found it to be much more capable than regular API Gemini.

[1]https://jules.google/


Would you mind giving a little more info on what you're getting Jules to work on? I tried it out a couple times but I think I was asking for too large a task and it ended up being pretty bad, all things considered.

I tried to get it to add some new REST endpoints that follow the same pattern as the other 100 we have, 5 CRUD endpoints. It failed pretty badly, which may just be an indictment on our codebase...


I let Jules write a PR in my codebase with very specific scaffolding, and it absolutely blew it. It took me more time to understand the ways it failed to grasp the codebase and wrote code for a fundamentally different (incorrectly understood) project. I love Gemini 2.5, but I absolutely agree with the gp (pauldix) on their quality / scope point.


> Gemini's 1M token context window is really unbeatable.

How does that work in practice? Swallowing a full 1M context window would take in the order of minutes, no? Is it possible to do this for, say, an entire codebase and then cache the results?


In my experience with Gemini it definitely does not take a few minutes. I think that's a big difference between Claude and Gemini. I don't know exactly what Google is doing under the hood there, I don't think it's just quantization, but it's definitely much faster than Claude.

Caching a code base is tricky, because whenever you modify the code base, you're invalidating parts of the cache and due to conditional probability any changed tokens will change the results.


Right now this is just in the AI Studio web UI. I have a few command line/scripts to put together a file or two and drop those in. So far I've put in about 450k of stuff there and then over a very long conversation and iterations on a bunch of things built up another 350k of tokens into that window.

Then start over again to clean things out. It's not flawless, but it is surprising what it'll remember from a while back in the conversation.

I've been meaning to pick up some of the more automated tooling and editors, but for the phase of the project I'm in right now, it's unnecessary and the web UI or the Claude app are good enough for what I'm doing.


I’m curious about this as well, especially since all coding assistants I’ve used truncate long before 1M tokens.


We're very excited about this release, over 4 years in the making. Over that time we adopted, contributed to, and helped lead parts of what we're calling the FDAP stack: Apache Arrow Flight, DataFusion, Arrow, and Parquet.

We wrote and contributed the Rust object store crate used in this stack and by many others to the ASF.

This release is based on a "diskless" architecture that uses object storage for all durability. With DataFusion it has a columnar, vectorized, standards compliant SQL query engine. We also built support for InfluxQL on top of it.

The other big thing we brought in is an embedded Python VM using PyO3 and Python Build Standalone. This makes it possible to do data collection, ETL, monitoring, alerting, and all kinds of tasks inside the database at the point of collection.

Happy to answer any questions about the big project, what's next or anything time series related.


Blog post author and InfluxDB creator and CTO here. Happy to answer any questions here or provide more technical detail.


Our intention with InfluxDB Core is that it's useful to large audience. Just not the group of people seeking a historical TSDB. It's a collector, processor, and recent data TSDB. If you're familiar with the TICK stack from our 1.x line, it's like Telegraf (the data collector), Kapacitor (the processor and monitoring agent), and an InfluxDB that is better on the most recent data.

The InfluxDB part of it is more narrowly scoped than previous versions, but the Telegraf and Kapacitor parts are much more feature rich than those previous products.


I talk a little bit more about this comment on a different submission of this post: https://news.ycombinator.com/item?id=42704526

Can you say more about your use case?


Post author, cofounder and creator of InfluxDB here. Happy to answer questions in this thread.

I'm guessing there will be questions about the 72 hour limit. There are two things we're looking at:

First, we're considering giving a free tier for at home and hobbyist usage of Enterprise, which doesn't have this limitation. So this would be kind of like what Tailscale does giving a free usage plan for their commercial software.

Second, for Core, the open source build, we're working on an update that will let it query any 72 hour window of historical data. Right now it doesn't evict data, it all still exists on disk or object storage as Parquet files, but we remove the metadata information from RAM to keep things optimized for the most recent 72 hours.

When the update is done, you'll be able to write and query for any period of time. But an individual query will be limited to a 72 hour time range. This is a service protection mechanism because of how the data is organized.

A file gets created for every 10 minute block of time for each table. So 72 hours is 432 files, which is a lot of GET requests to S3 for a single query. We don't want to increase the range because of that. Multiple queries combining a longer range, or accessing the data from third-party clients is all still possible.

In Enterprise, our commercial product, we have a compactor that collapses these files into larger time blocks that also creates an index that the query engine can use.

Doing it this way was a deliberate choice so that we could have a permissively license open source project separate from the commercial product. If we put the compactor into the open, we'd have to put it under a source available license to limit usage so that we can still sell the database.

Our hope is that there's still an audience of users that will find Core useful on its own, even without any commercial relationship with us. It's not a full historical TSDB, but it's not intended to be. It's meant to be a recent data engine that can collect, process, monitor, ship, and store data paired with a fast analytical query engine against the recent buffer (or recently persisted buffer).

Happy to answer any followup questions about this or the release generally.


I haven't used influxdb in a project yet, but I'm a fan of its capabilities!

The core-enterprise dichotomy seems more or less the same as what scylla had until recently. Does influxdata have different considerations from scylla that will allow influxdb to remain open source in the long term?


We're open core and have been since 2016. We've deliberately limited the scope of what the open source project is supposed to do. It should be great at this use case of collecting processing, storing, and querying recently buffered data.

The commercial offering is the historical time series DB along with a bunch of other features around high availability, read replication, fine grained security, and the compaction engine which enables longer range queries and row level deletes.

I think Scylla had most of their DB in the open and then a small slice of Enterprise functionality (although I'm not super familiar with their product line).

Ideally, we'd have many open source users and even our commercial customers would use the open source in addition to the commercial offering.

But ultimately, it's about finding a sustainable business model that keeps more software coming. We have a preference for permissive open source over source available. In my view, we may as well create freemium rather than source available.

With this version of InfluxDB, we've been able to invest heavily into Apache projects that lie at the core of it: Arrow, DataFusion, Parquet, and the object store crate, which we developed and donated to the ASF.

We'd like to continue that work because we think that a highly performant, modular, vectorized query engine (i.e. DataFusion) should be a free commodity that's widely available and widely contributed to.


It's a curious way to differentiate between the open source and paid versions, but I guess you have to pick something.

The 72 hour thing is new with 3.0, right? What were the main differences in the 2.0 version between open source and paid versions?


2.0 was single server. Our paid offering of that is a usage based cloud platform that’s highly available and managed.


What is the minimum resident RAM size per individual active unique series? Or what's a typical RSS RAM size for 10 or 100 million unique active series? How does unlimited cardinality avoid RAM exhaustion in this version?


Core doesn't index the metadata so it uses less RAM for higher cardinality data. However, if you have 100M series and you're writing to all of them at the same time, you're going to need some amount of RAM just to buffer it all up and then ship it off to storage as Parquet. The Enterprise product has a compactor that creates indexes as it goes, but those indexes are lighter weight than those in v1 and v2. Also, users can specify which columns they want to appear in those indexes, so they can leave out high cardinality ones if they want to save on RAM. In v3 you can brute force the query against high cardinality data, unlike v1 & v2, which would eat up a ton of RAM to do so.


Excellent! Keep up the great work.


We think that Core will fill some of the use cases of previous OSS versions of InfluxDB, but not all. But we also expect that Core will be useful in many more places that previous OSS versions of InfluxDB were not.

So Core isn't intended to be a full historical TSDB. It's more like a data collector, processing engine, data shipper and recent data buffer/DB.

For a full historical TSDB, that's the product we sell. Keeping the two separate gives us the ability to have real open source vs. combining them and requiring a different license that lets us do freemium.

We'll likely have a freemium tier for the commercial product (Enterprise), but that's separate from the open source project.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: