Disclaimer: Newbie here, forgive me if I'm butchering some terms/concepts.
I am currently working on a system for doing visual similarity search and am considering using ES to do the search and indexing. A popular method for visual search is the bag of "visual" words approach. The idea is to map the problem of visual search to text search so that text search techniques suddenly apply (inverted index, tf-idf, etc.). Without going into details, I currently have a function that takes an image and outputs a bag of visual words. As a fictional example:
[word1: 10, word20: 3, word32: 4, word200: 11]
It can be thought of as a sparse vector or frequency histogram of words. The number of possible words (aka dictionary) is fixed (e.g. 1000). It's pretty much the same internal representation that text search engines use except that the words come from a made up language (e.g. word1, word2, word3, etc.!).
In order to search an image, you build its bag of visual words on the fly, e.g.:
[word1: 8, word10: 2, word32: 4, word200: 14]
And compare it through a similarity measure (e.g. tf-idf + cosine similarity) with each of the indexed vectors. Matching images are those that reach a certain similarity threshold (e.g. usually chosen empirically). The search can be speeded up with an inverted index same as with text search so that the ranking is only done on a set of candidate images, etc. One notable difference between text and image search is that with image search, queries are quite large (e.g. they typically contain as many "visual" words as indexed images).
Now, as far as I know, I can't pass a "bag of words" directly for ElasticSearch to index but can pass it a document which ES will transform into a bag of words internally. Since what I have is a bag of words, I could synthesize a document that I know will map to the same bag of words internally in ES, e.g.:
[word1: 8, word10: 2, word32: 4, word200: 14] would become the following document:
word1 word1 word1 word1 word1 word1 word1 word1 word10 word10 word32 word32 word32 word32 word200 word200 word200 word200 word200 word200 word200 word200 word200 word200 word200
And I would just ask ES to index that document. I suppose I could do the same with queries.
Now, this strategy might work but:
- I would need to make sure ES isn't doing unexpected stuff with the analyzer/tokenizer.
- Minimize amount of bytes I need to use (e.g. drop the "word" prefix!)
- Would it be possible to pass my bag of words representation directly and have a custom tokenizer on ES that parses it and outputs the right amount of words? e.g. insert the document as
[1: 8, 10: 2]and have a custom ES tokenizer output the following token stream:
1 1 1 1 1 1 1 1 10 10.
- Is there a way to make sure that ES doesn't store the raw document
- There is typically no order to the visual words... is there a way to disable word positions/offsets to save some space?
- Is it possible to modify the ranking function so that it only does something really simple, so that score becomes more predictable and that it becomes more easy to set a threshold? (e.g. no field length norm