Compound Word Token Filter Configuration File


(Marcos Cazulo) #1

Hi,

I am trying to switch from having a word_list to a file with the list of words for the compound word token filter.
Right now I have it working defining the filter in the following way:

"compound_word_splitter":{
"type":"dictionary_decompounder",
"min_word_size":4,
"min_subword_size":3,
"word_list": ["icecream","smokehouse","car"]
}

I would like to have a separate file for the word list as the docs.

"compound_word_splitter":{
"type":"dictionary_decompounder",
"min_word_size":4,
"min_subword_size":3,
"word_list_path": "analysis/theWords.txt"
}

I tried having the word list file in the two following formats and the filter does not work:

"icecream","smokehouse","car"
icecream,smokehouse,car

Does anyone have any ideas what I am missing so that the list of words is recognized?
I have analysis/theWords.txt relative the config file.

Thank you.


(Marcos Cazulo) #2

Well after experimenting, it turns out that the word list needs to be delimited by new lines, similar to how it is specified for stop words.

It would be great if this detail was also given for the compound word token filter in the documentation.


(Mark Walkom) #3

Thanks for the suggestion, I've raised https://github.com/elastic/elasticsearch/issues/13595 to get that fixed.


(system) #4