We are currently using a bag-of-features approach to indexing millions of images. The idea is to translate each image into a bag of feature tokens. There could be 100s of tokens in this bag. We map these feature tokens to unique integers, so each image ends up being translated into a string. Something like '1 3 5 45 ... 565 .. 9176' These are all fixed length strings with 300 integers. These integers range from 1 - 10000
We now want to use this string and retrieve other strings that are very similar. In this case similar would mean strings that have the most integers in common.
Our current index has about 50 million documents, where each of these documents is a fixed string described above. We are currently just doing dumb default tokenization while indexing. This results in us getting a search latency of about 5 to 6 seconds. How can we do a better job and reduce this latency to under a second?