Analysis for multiple languages, filtering invalid characters

Hi there,
I've setup analyzers for about 20 different langauges using Hunspell and the ICU plugins, but I'm noticing an issue. Sometimes I have a document that contains say both english and japanses characters. My current approach is to analyze it once as english into an en_doc type and once as japanese into a ja_doc type. I had assumed that when i run it through the english analyzer it would ignore any japanese characters and vice versa, but apparently i'm wrong. So when it goes through the english analyzer the japanese characters come through and the tokenization is all messed up (ie. half words and such). Is there a way that I can filter out any character that is not valid for a particular languages? I found the UnicodeSet inside ICU but I'm not sure how to use it in this way, and I'm sure i'm not the first to have this issue. Any suggestions? Am i going about this the wrong way?

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
For more options, visit