Getting list of terms ranked by their document frequency?

I'm using Elasticsearch to index images using the bag of visual words approach (the words are stored in a "visual_words" field in image documents). Unfortunately, it seems like the inverted index isn't helping speed up my searches much. In my test, I have indexed 25000 images and a typical query returns >24000 results. I'm guessing this is because a few of my dictionary words are present in almost all documents. Is there anything else that would cause Elasticsearch to return >24000 results out of 25000? Also, is there a way for me to get a list of terms ordered by their document frequency so I see if my hypothesis is right and if this is the case for only a few words or lots of them.

Thanks!

So I did some math and the issue really is the size of my dictionary:

Here's my math (in python): https://gist.github.com/0c8589a67fcaaa32e4ccdbde1bfd6d3d

It might be wrong but seems to match the result I'm getting experimentally.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.