Getting list of terms ranked by their document frequency?

I'm using Elasticsearch to index images using the bag of visual words approach (the words are stored in a "visual_words" field in image documents). Unfortunately, it seems like the inverted index isn't helping speed up my searches much. In my test, I have indexed 25000 images and a typical query returns >24000 results. I'm guessing this is because a few of my dictionary words are present in almost all documents. Is there anything else that would cause Elasticsearch to return >24000 results out of 25000? Also, is there a way for me to get a list of terms ordered by their document frequency so I see if my hypothesis is right and if this is the case for only a few words or lots of them.


So I did some math and the issue really is the size of my dictionary:

Here's my math (in python):

It might be wrong but seems to match the result I'm getting experimentally.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.