Show HN: Cartwright – Automating detection of geographic and temporal features

BenoitP · on Nov 18, 2022

This is nice!

I believe we are just ripe for a new semantic web cycle. This is what comes after AI has had a long enough summer, right?

> Cartwright can easily detect things like country, day, latitude

Can it do so in a normalized way? It'd be great to have it produce an "official type" like this one:

https://www.wikidata.org/wiki/Q34027

Then one can imagine a graph-based feature augmentation with other datasets, and learn a SPARQL query. Linking the lat-long pair in a real-estate dataset to geographical features[1] might give a hedge to predicting the local price map.

[1] from the SPARQL wikidata examples: https://query.wikidata.org/#%23Museums%20in%20Brittany%0ASEL...

brandonmrose · on Nov 18, 2022

That's actually a great idea. We use schemas for the categories it detects but could tie them to wikidata entries which is far more generalizable. Will look into this!

We're also aiming to explore using large language models to support/improve this.

_jvqe · on Nov 18, 2022

Yay! When they add the datasets for tundra melting and unaccounted methane emissions, deep-learning can finally figure it out with quality datasets :|

https://www.scientificamerican.com/article/melting-tundra-re...

2=1+1 in 2022 :)

Thanks data!

brandonmrose · on Nov 18, 2022

That's definitely the goal! Look at arbitrary datasets and try to understand what's in them. It's particularly important for climate data where aligning different datasets geospatially and temporally can be a challenge.

_jvqe · on Nov 18, 2022

We already know why it happened. No amount of compute can answer the obvious.

Its easy to adapt using basic science :)

No word yet on the offset the nord stream blowout caused, in addition to the tundra melts that occured <air quotes> naturally </air quotes>.

antman · on Nov 18, 2022

Seems to identify very specific types of fields. Are there any other libraries to detect other types of fields in a dataset?

marshHawk4 · on Nov 18, 2022

There are more general ways of doing something like this with LLMs. We are looking into LLMs to improve our results. Cartwright is very specific, but it is easy to add new categories to detect if you have specific categories in mind. See the contribution doc here. https://github.com/jataware/cartwright/blob/main/docs/contri...

melony · on Nov 18, 2022

Do you use the column title as input too?

marshHawk4 · on Nov 18, 2022

Yes the column title is used as well. Not in the same way as the column values, but helps with categorization. We have a list of column titles we are interested in categorizing regardless of the values. We denote the difference of this type of categorization in the results.