We'd like to present a list of words from our indexed documents (basically articles from rss feed descriptions). This list would represent words with the highest occurrences in all the documents.
I'm not sure about the way to do it. I tried with a facet and retrieved such a list but it was polluted with all the "small" words you wouldn't want to see. My language is french but I guess it's the same in every language : I'd like to keep only relevant words and get rid of 'the', 'them', 'of' and so on...
Has it something to do with the way we have indexed are documents? I didn't use any specific analyzer or tokenizer. Does it mean that the default behaviour is that every word is indexed for the search, even the 'stop words'? And if it is wrong from the start, is there any workaround to produce such a list. It would be a drag to have to re-index everything right now.
I read a lot of things on this forum but I'm still very confused about analyzers and tokenizers.
Any help on this matter would be much appreciated.