Use of facets, analyzers, tokenizers

ssabat2000 · June 22, 2012, 2:19pm

Hi,

We'd like to present a list of words from our indexed documents (basically articles from rss feed descriptions). This list would represent words with the highest occurrences in all the documents.
I'm not sure about the way to do it. I tried with a facet and retrieved such a list but it was polluted with all the "small" words you wouldn't want to see. My language is french but I guess it's the same in every language : I'd like to keep only relevant words and get rid of 'the', 'them', 'of' and so on...
Has it something to do with the way we have indexed are documents? I didn't use any specific analyzer or tokenizer. Does it mean that the default behaviour is that every word is indexed for the search, even the 'stop words'? And if it is wrong from the start, is there any workaround to produce such a list. It would be a drag to have to re-index everything right now.
I read a lot of things on this forum but I'm still very confused about analyzers and tokenizers.
Any help on this matter would be much appreciated.

Sébastien

ssabat2000 · June 24, 2012, 5:59pm

Still, any help would be much appreciated. I'm really struggling with this.

Sébastien

Topic		Replies	Views
Term Aggregations and StopWords Elasticsearch	2	955	July 6, 2017
Need suggestions on type of query to be used for a given analysis for better results? Elasticsearch	2	373	July 6, 2017
Stopwords in term aggregation Elasticsearch	7	1137	July 5, 2017
Tokenizers, stop words and query analysis in App Search Elastic Search elastic-app-search	2	663	March 5, 2020
Analyzers at Index time and search time are not matching Elasticsearch	1	336	December 28, 2021

Use of facets, analyzers, tokenizers

Related topics