I would like to get a term frequency report for a relatively large index.
This is the background of what I am trying to do. I have formulated something called a grouping which is nothing but result sets. Say my index is having a Million documents, these result set grouping would be something like 4000 or 5000 in size. Within this result set, I would like to mine the interesting keywords, perhaps create a report out of it to analyse.
I am still in the exploration phase, so I would like to see the most commonly used terms and its frequency (TTF) for not just a single word, but for 1, 2, 3 words appearing in a sequence. An example I could cite for a 3-word is "Advanced Encryption Standards". There is a very high probability for me to encounter noise for 1-word items, but my assumption is that I could ignore them by defining stopwords.
I went through Term Vectors, but that is something not what I want, as it focusses on a single document, but not on a result set (or the entire index). Plus I don't have any input keywords here as my objective is to figure them out.
I have experience with SOLR and ES and this problem I am encountering is relatively new. I went through various documents, but I could not narrow down (May be I did not spend enough time!). Can someone please point me to the right place to look at for this problem?
Any pointers is greatly appreciated!