Using a dictionary in es tokenization for filtering?

ApproximateIdentity · April 21, 2016, 4:24pm

I'm talking about functionality similar to the documentation here:

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-hunspell-tokenfilter.html

My question is, is it possible to use a dictionary, such as hunspell or custom, to filter out tokens; for example, invalid English words (similar to the python nltk library nltk.is_english_word(word) method)? Even though the link I posted refers to a "filter" it doesn't seem to be filtering in the way I understand the term and instead does stemming, but leaves in words that aren't in the dictionary.

Thanks for any help.

ApproximateIdentity · April 21, 2016, 7:33pm

I'll update for anyone who happens upon this question. There is an option in es for this:

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keep-words-tokenfilter.html

You can just use any set of words (say from open source dictionaries online) and put them in a file. Then you use the keep_words_path and you're cooking with gas.

Topic		Replies	Views
Keep-words-tokenfilter example Elasticsearch	3	711	December 8, 2017
It is possibile don't token word with elasticsearch? Elasticsearch	3	385	July 6, 2017
Dictionary of stop words with special behavior Elasticsearch	4	570	July 5, 2017
Basic word_list problem Elasticsearch	5	1000	January 8, 2018
How can we achive this in elasticsearch Elasticsearch	5	378	January 27, 2022

Using a dictionary in es tokenization for filtering?

Related topics