Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Cartwright – Automating detection of geographic and temporal features (github.com/jataware)
49 points by brandonmrose on Nov 18, 2022 | hide | past | favorite | 9 comments
Cartwright is a data profiler that identifies and categorizes spatial and temporal features. Cartwright uses deep learning, natural language processing, and a variety of heuristics to determine whether a column in a dataset contains spatial or temporal information and, if so, what is specifically contained.

Cartwright was built to automate complex data pipelines for heterogenous climate and geopolitical data that are generally oriented around geospatial and temporal features (think maps and time series). The challenge that Cartwright solves is automatically detecting those features so they can be parsed and normalized. This problem turns out to be quite tricky, but Cartwright makes it simple.



This is nice!

I believe we are just ripe for a new semantic web cycle. This is what comes after AI has had a long enough summer, right?

> Cartwright can easily detect things like country, day, latitude

Can it do so in a normalized way? It'd be great to have it produce an "official type" like this one:

https://www.wikidata.org/wiki/Q34027

Then one can imagine a graph-based feature augmentation with other datasets, and learn a SPARQL query. Linking the lat-long pair in a real-estate dataset to geographical features[1] might give a hedge to predicting the local price map.

[1] from the SPARQL wikidata examples: https://query.wikidata.org/#%23Museums%20in%20Brittany%0ASEL...


That's actually a great idea. We use schemas for the categories it detects but could tie them to wikidata entries which is far more generalizable. Will look into this!

We're also aiming to explore using large language models to support/improve this.


Yay! When they add the datasets for tundra melting and unaccounted methane emissions, deep-learning can finally figure it out with quality datasets :|

https://www.scientificamerican.com/article/melting-tundra-re...

2=1+1 in 2022 :)

Thanks data!


That's definitely the goal! Look at arbitrary datasets and try to understand what's in them. It's particularly important for climate data where aligning different datasets geospatially and temporally can be a challenge.


We already know why it happened. No amount of compute can answer the obvious.

Its easy to adapt using basic science :)

No word yet on the offset the nord stream blowout caused, in addition to the tundra melts that occured <air quotes> naturally </air quotes>.


Seems to identify very specific types of fields. Are there any other libraries to detect other types of fields in a dataset?


There are more general ways of doing something like this with LLMs. We are looking into LLMs to improve our results. Cartwright is very specific, but it is easy to add new categories to detect if you have specific categories in mind. See the contribution doc here. https://github.com/jataware/cartwright/blob/main/docs/contri...


Do you use the column title as input too?


Yes the column title is used as well. Not in the same way as the column values, but helps with categorization. We have a list of column titles we are interested in categorizing regardless of the values. We denote the difference of this type of categorization in the results.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: