I'm using Elasticsearch to index images using the bag of visual words approach (the words are stored in a "visual_words" field in image documents). Unfortunately, it seems like the inverted index isn't helping speed up my searches much. In my test, I have indexed 25000 images and a typical query returns >24000 results. I'm guessing this is because a few of my dictionary words are present in almost all documents. Is there anything else that would cause Elasticsearch to return >24000 results out of 25000? Also, is there a way for me to get a list of terms ordered by their document frequency so I see if my hypothesis is right and if this is the case for only a few words or lots of them.
Thanks!