neuspo uses machine learning and graph networks to summarize content, aggregate similar text into events and run classifiers to only keep relevant fact-driven content.
The Translation pipeline translates text between languages. It supports over 100+ languages. Automatic source language detection is built-in. This pipeline detects the language of each input text row, loads a model for the source-target combination and translates text to the target language.
One reason is that the most popular models were developed using either the TensorFlow or PyTorch Python APIs. The pre-trained models took an immense amounts of compute resources to build. Additionally, those who built the models weren't necessarily developers and Python is an low-barrier to entry language.
There are a number of models that are now available via APIs and can be used from any language.
> The pre-trained models took an immense amounts of compute resources to build.
Oh definitely, but nobody is serving models from the same machine + process that they used to train them right? And solutions like ONNX exist (although TF and PyTorch’s support is inconsistent at best)
Additionally, those who built the models weren't necessarily developers and Python is an low-barrier to entry language.
It just feels like an engineering anti-pattern to build “down” to this level, instead of skilling people up, or standardising on some standard model serialisation and serving format, model serving tools exist, and they’re often written in faster/more optimised languages, so at that point, why bother with Python after actual model training at all.
True, if a team doesn't want to use Python, the way models were trained shouldn't be the reason to use Python. ONNX is a good option, txtai has a notebook that shows how to export models for use in Rust/JavaScript/Java - https://github.com/neuml/txtai/blob/master/examples/18_Expor...
Seems like a lot of tooling is being created in other languages besides Python, may just take some time to get there.
At a high level, txtai uses a similar clause embedded in SQL statements. For example, "SELECT id, text, score FROM txtai WHERE similar('feel good story') AND text LIKE '%good%'". This statement is parsed and the similar clause runs against the approximate nearest neighbor index. The result ids are then loaded into a temporary table and the SQL statement is dynamically rewritten to change the similar clause into "id IN <temporary table>".
txtai uses transformers to transform data (text, images, audio) into embeddings. Those embeddings are then loaded into an approximate nearest neighbor index for search. On top of that, content is loaded into a relational database to support SQL based filtering. It's trying to get the best of both vector/similarity search alongside standard structured search using a SQL syntax.
Components can be split up, for example there could be a server that vectorizes text into embeddings and another server that hosts the indexes.
There is also a pipeline and workflow framework (https://neuml.github.io/txtai/workflow/). This component has modules that assists with splitting data, transforming, summarizing, translating, parsing tabular content. Workflows can be used purely for transformations or as a driver to load data.
I've primarily focused on single query times, which have averaged around 5ms - 25ms depending on the size of the index. Queries can be batched so query times wouldn't increase linearly. You could batch quite a few and still get subsecond response times.
All three libraries are approximate nearest neighbors but I know at least for Faiss, it can be configured to effectively be a precise query.
Read more: https://medium.com/neuml/neuspo-d42a6e33031
neuspo is powered by txtai (https://github.com/neuml/txtai)