Efficient retrieval of similar vectors or long strings

Cygorger · March 10, 2017, 7:45am

We are currently using a bag-of-features approach to indexing millions of images. The idea is to translate each image into a bag of feature tokens. There could be 100s of tokens in this bag. We map these feature tokens to unique integers, so each image ends up being translated into a string. Something like '1 3 5 45 ... 565 .. 9176' These are all fixed length strings with 300 integers. These integers range from 1 - 10000

We now want to use this string and retrieve other strings that are very similar. In this case similar would mean strings that have the most integers in common.

Our current index has about 50 million documents, where each of these documents is a fixed string described above. We are currently just doing dumb default tokenization while indexing. This results in us getting a search latency of about 5 to 6 seconds. How can we do a better job and reduce this latency to under a second?

system · April 7, 2017, 7:45am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Search for similar documents Elasticsearch	4	1870	July 6, 2017
Search similar words in a big text Elasticsearch	3	535	July 6, 2017
Is there any way for doing complicated mathematical matching of indexes while retrieval? <not just simple text comparison > Elasticsearch	3	288	July 6, 2017
Term vectors for computing document similarity Elasticsearch	7	1353	July 6, 2017
Using ES for bag of "visual" words image search? Elasticsearch	5	1354	May 8, 2017

Efficient retrieval of similar vectors or long strings

Related topics