Using ES for visual search

Hi all, I am pretty new to Elasticsearch and wondering how I can index a set of arrays (each array contains 128 float numbers); then with a new array, how can I find it's nearest neighbor with ES and if such how do I define the distance function?

Long story if you have time, what I am trying to do is to use ES for image retrieval purpose using the bag-of-word model. Let's say I have about 1M images and extract SIFT feature from each image (average 1000 features for each image), from there I build a dictionary of "visual words" (using K-mean clustering) of, say, 2M words. Each "visual word" is an array of 128 float values.

So the 1st task is to index the whole dictionary into Elastic Search. So far what I think I could do is to represent each "visual word" with a JSON like this:

{images: [list of images that the word appears - will explain later], features: [array of 128 floats]} 

in which the features field could be a straight array like: features: [1.2, 0.1456, ...], or nested key-value like features: [{f1: 1.2}, {f2: 0.1456}, ...]

The next task is for each of my 1M images, each of which comes with a set of SIFT features, I need to "quantize" each of the image's features to a "visual word" in the dictionary (finding nearest neighbor). After that is done, I imagine that the "images" field of each "visual word" in my index will be updated like this: For word 1, images:[img10, img35, img278], for word 2, images: [img12345, img9765], etc. (you know, like an index in a book).

The query task is similar, after querying each feature of a new image (i.e., finding the nearest neighbor in ES), I need to rank the best match images from my 1M images.

If you have read to this point, may be you could tell me if what I propose to do is feasible with Elasticsearch. Any advice is very much appreciated. Thanks

This article my colleague wrote is Solr-centric. But the same ideas could
apply to Elasticsearch. Not a full answer, but thought I'd share

1 Like

If you want a visual search solution without using ES, check out JustVisual - Their API will allow you to upload your 1M images and then conduct a visual search against your index

1 Like

Thanks, still not what I am looking for. I want to do it myself with Elasticsearch. I'm wondering if my approach is wrong, that the nearest neighbor search should be done outside of ES

Theoretically this is possible. I'm trying to understand your problem though. Are you saying there are 2M words per image? If this is the case, then a word is definitely NOT a token.

But let's say you describe an image as a set of features and you don't care about the feature strength. You just decide yes or no whether an image has that feature. Then absolutely this is possible. You can turn ANYTHING into tokens. This is actually what Chapter 4 of our book Relevant Search at great length.

Now but what this will get you is a cosine similarity. You can see an extremely naive image search here which indexes features of images. In this case just RGB values of a specific sector. If you search with a different image, to find similar images, the result is a giant OR query over the pixels of the image. The OR query scores via summation. It's like a big vector dot product with several fudge factors. The images with the most matches in each dimension get the highest score.
Note above That's a pretty niave implementation. Here's a smarter one.

So is this nearest neighbor? You're finding the image with the most number of similar features across all the dimensions of the image you're querying. It's not quite nearest neighbor. Yet with Lucene-based search, you can plug just about anything. You can modify scoring to your hearts content. You could script Elasticsearch or even implement your own Lucene scorers to rank anyway you see fit.

Anyway, I don't fully understand your problem. But maybe this helps. The point is that search can be a highly distributed feature similarity system. The key is expressing your features as tokens. You can do that by fudging them like I do in my dumb image search. Or it can be done through manipulating the analysis process.

1 Like

Sorry for my lousy explanation. Let me try that again

Basically I need to index, say, 1M (million) images. Each images is described by a number of features, each feature is a vector of 128 float numbers. Different images could have different number of features, but they could share some features. For example, image 1 might have 500 features, image 2 has 600 features, and there're 200 features in image 1 that are also in image 2. Thus, even though each image has more than 500 features, the total number of features collected from 1M images is not 500M but only, say, 2M. So each image is like a book and each feature is like a word.

Now, given a query image, I can also describe it by a number of features, each feature is also a vector of 128 float numbers, except that each feature might not be exactly the same as one of the 2M features mentioned above. Thus, when I search for each (new) feature from my index in ES, I don't search for the exact feature but the nearest neighbor instead.

My question is: is that nearest neighbor search feasible with Elasticsearch, and if such, how do I define the distance between two features (two vectors) in the query. What I mean is how to write the query to represent my distance, not the formula of the distance (could be just use L2 norm).