May an additional english stemmer interfere with other script language?


(Dong Hyun Kim) #1

Hello everyone,

I currently test indexing mult-language documents with combining ES basic analyze component. I want use additional english stemmer at korean, japanesem chinese analyzer for getting more wide search result of english. (english word appears in high probability all document)
have you ever used [different from english script tokenizer, tokenfilters] + [english stemmer] combination?
I tested some, and found no side-effect. It seems much distinguishable script and differ in unicode block.

please share your experiences.

thank you.

testing setting likes this,

"korean_english": {
"filter": [
"trim",
"arirang_filter", (custom opensource)
"decompounder",
"delimiter",
"lowercase",
"english_stop",
"english_stemmer"
],
"tokenizer": "arirang_tokenizer" (custom opensource)
},

"japanese_filter" :{ "type":"custom", "tokenizer" : "kuromoji_tokenizer", (custom opensource) "filter" : [ "kuromoji_baseform", "kuromoji_part_of_speech", "cjk_width", "stop", "english_stop", "delimiter", "kuromoji_stemmer", "english_stemmer", "lowercase" ] }

"delimiter" :{ "type":"word_delimiter", "catenate_all" : true, "type_table_path" : "delimiterType.json", "type_table" : true, "split_on_numerics" : false },

"english_stemmer": { "type": "stemmer", "language": "english" }, "english_stop": { "type": "stop", "stopwords": "_english_" },


(system) #2