Analysis for multiple languages, filtering invalid characters

jschelle_2 · August 7, 2013, 9:24pm

Hi there,
I've setup analyzers for about 20 different langauges using Hunspell and the ICU plugins, but I'm noticing an issue. Sometimes I have a document that contains say both english and japanses characters. My current approach is to analyze it once as english into an en_doc type and once as japanese into a ja_doc type. I had assumed that when i run it through the english analyzer it would ignore any japanese characters and vice versa, but apparently i'm wrong. So when it goes through the english analyzer the japanese characters come through and the tokenization is all messed up (ie. half words and such). Is there a way that I can filter out any character that is not valid for a particular languages? I found the UnicodeSet inside ICU but I'm not sure how to use it in this way, and I'm sure i'm not the first to have this issue. Any suggestions? Am i going about this the wrong way?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Dumb question- using the cjk analyzer Elasticsearch	3	620	July 6, 2017
Asian characters and not words are tokenized - CJK Analysis and Tokenization Problems Elasticsearch	8	705	July 6, 2017
Indexing non-English text Elasticsearch	11	2733	July 6, 2017
Combo analyzer - Issue with English and Japanese text being stored in same fields Elasticsearch	5	1706	July 6, 2017
How to normalize Japanese? Elasticsearch	4	2286	July 6, 2017

Analysis for multiple languages, filtering invalid characters

Related topics