Using ES for bag of "visual" words image search?

Disclaimer: Newbie here, forgive me if I'm butchering some terms/concepts.

Hi all,

I am currently working on a system for doing visual similarity search and am considering using ES to do the search and indexing. A popular method for visual search is the bag of "visual" words approach. The idea is to map the problem of visual search to text search so that text search techniques suddenly apply (inverted index, tf-idf, etc.). Without going into details, I currently have a function that takes an image and outputs a bag of visual words. As a fictional example:

[word1: 10, word20: 3, word32: 4, word200: 11]

It can be thought of as a sparse vector or frequency histogram of words. The number of possible words (aka dictionary) is fixed (e.g. 1000). It's pretty much the same internal representation that text search engines use except that the words come from a made up language (e.g. word1, word2, word3, etc.!).

In order to search an image, you build its bag of visual words on the fly, e.g.:

[word1: 8, word10: 2, word32: 4, word200: 14]

And compare it through a similarity measure (e.g. tf-idf + cosine similarity) with each of the indexed vectors. Matching images are those that reach a certain similarity threshold (e.g. usually chosen empirically). The search can be speeded up with an inverted index same as with text search so that the ranking is only done on a set of candidate images, etc. One notable difference between text and image search is that with image search, queries are quite large (e.g. they typically contain as many "visual" words as indexed images).

Now, as far as I know, I can't pass a "bag of words" directly for ElasticSearch to index but can pass it a document which ES will transform into a bag of words internally. Since what I have is a bag of words, I could synthesize a document that I know will map to the same bag of words internally in ES, e.g.:

[word1: 8, word10: 2, word32: 4, word200: 14] would become the following document:

word1 word1 word1 word1 word1 word1 word1 word1
word10 word10 
word32 word32 word32 word32
word200 word200 word200 word200 word200 word200 word200 word200 word200 word200 word200

And I would just ask ES to index that document. I suppose I could do the same with queries.

Now, this strategy might work but:

  • I would need to make sure ES isn't doing unexpected stuff with the analyzer/tokenizer.
  • Minimize amount of bytes I need to use (e.g. drop the "word" prefix!)
  • Would it be possible to pass my bag of words representation directly and have a custom tokenizer on ES that parses it and outputs the right amount of words? e.g. insert the document as [1: 8, 10: 2] and have a custom ES tokenizer output the following token stream: 1 1 1 1 1 1 1 1 10 10.
  • Is there a way to make sure that ES doesn't store the raw document
  • There is typically no order to the visual words... is there a way to disable word positions/offsets to save some space?
  • Is it possible to modify the ranking function so that it only does something really simple, so that score becomes more predictable and that it becomes more easy to set a threshold? (e.g. no field length norm fieldNorm)

Other thoughts?

Hi Oli,

I don't have a great "production ready" answer but you might be interested in this talk I've delivered several times. In it I give this demo which tokenizes images by position and RGB to find similar images. It's all rather naive and intended more for educational purposes.

Sujit Pal however came to one of my talks and wrote a series of blog articles on image similarity/search including this one and this one.

I personally don't mind hacking term frequency to mean some arbitrary feature strength. Creating dumb fields like "blue blue blue blue" to mean "blue" of strength "4" can work, with all the caveats (disable _source, field storage, analysis, etc). Others may think I'm crazy :slight_smile:

1 Like

I wonder if a custom TokenFilter would be useful in these cases - the input could be of the form 'blue_4' and the output could be 4 'blue' tokens.

Yes that's what I was thinking in the 3rd bullet point... It could potentially help save some bandwidth but I don't know how hard it is to write a custom tokenizer.

Edit: Oh, looks like this is already possible with delimited_payload_filter. See this example: http://sujitpal.blogspot.com/2016/05/elasticsearch-based-image-search-using.html

Edit2: If I understand correctly, the above is only useful if implementing a custom scoring function that uses payloads. So not exactly what I had in mind. @Mark_Harwood would this require a pull request on Elasticsearch or could it be implemented as a plugin?

A plugin should work. I had a quick search for custom tokenfilter examples and found this project which should be a useful start point : https://github.com/francesconero/elasticsearch-concatenate-token-filter

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.