Combo analyzer - Issue with English and Japanese text being stored in same fields

I am trying to add multilingual support to elastic search and part of the
requirement is to allow same field to store either english and japanese
text. While research i stumbled upon combo analyzer plugin where we can
give two analyzers on single field. Following is the configuration. I am
using standard analyzer for english and kuromoji for japanese:

index :
analysis :
analyzer :
my_combo :
type : combo
sub_analyzers : [standard, kuromoji]
deduplication : true

Evaluating japanese text yields correct results with kuromoji: curl -XGET
'localhost:9200/myindex/_analyze?analyzer=kuromoji&pretty=true' -d '最近どうですか'

But when analyzing with my_combo, it also applies standard analyzer to
japanese text which results in tokens being created for each japanese
character (behaviour of standard analyzer) as well as tokens created using
kuromoji .
curl -XGET 'localhost:9200/myindex/_analyze?analyzer= my_combo&pretty=true'
-d '最近どうですか'

Is there anyway in which elastic search can detect japanese language and
apply only kuromoji analyzer to japanese text? The other option that i was
considering was to use multi field type and store japanese text in
different field altogether but was wondering if there is easy way defined
in elastic search to do handle such scenarios?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

I'm not sure if this is sufficient for you but I would build my own
analyzer in the settings based on the japanese tokenizer and the
tokenfilters you need and base the tokenization on the JapaneseTokenizer.
It should tokenize the english text only on the whitespaces and then for
further processing I'd add lowercase, word-delimiter etc. as part of the
filter chain and only work with one analyzer.

does this make sense?

simon

On Friday, February 1, 2013 7:07:42 PM UTC+1, hemant pahilwani wrote:

I am trying to add multilingual support to elastic search and part of the
requirement is to allow same field to store either english and japanese
text. While research i stumbled upon combo analyzer plugin where we can
give two analyzers on single field. Following is the configuration. I am
using standard analyzer for english and kuromoji for japanese:

index :
analysis :
analyzer :
my_combo :
type : combo
sub_analyzers : [standard, kuromoji]
deduplication : true

Evaluating japanese text yields correct results with kuromoji: curl
-XGET 'localhost:9200/myindex/_analyze?analyzer=kuromoji&pretty=true' -d
'最近どうですか'

But when analyzing with my_combo, it also applies standard analyzer to
japanese text which results in tokens being created for each japanese
character (behaviour of standard analyzer) as well as tokens created using
kuromoji .
curl -XGET
'localhost:9200/myindex/_analyze?analyzer= my_combo&pretty=true' -d
'最近どうですか'

Is there anyway in which elastic search can detect japanese language and
apply only kuromoji analyzer to japanese text? The other option that i was
considering was to use multi field type and store japanese text in
different field altogether but was wondering if there is easy way defined
in elastic search to do handle such scenarios?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey...i tried out the suggested approach but ran into issue with how wild
card search works. I have been using standard analyzer and its default
filters to do tokenization as i require regular prefix wild card search on
english words. For example if i have "takanori" stored in index then
searching for tak* gives me document with takanori. Sample query below:

{query:{"query_string" : {"query" : "tak*","fields" : [
"firstName"],"use_dis_max" : true,"analyze_wildcard" : true}}}

Based on the suggestion above i created the custom analyzer with japanese
tokenizer and added standard filter to it and used mapped firstName field
to use it:

        my_default_analyzer : 
            type : custom
            tokenizer : kuromoji_tokenizer
            filter : [kuromoji_baseform, kuromoji_part_of_speech, 

kuromoji_readingform, kuromoji_stemmer, standard]

But running the same query having wild card search doesn't give me back any
results. It does give me back results if i search for while string
"takanori"

My suspicion is the type that is getting associated - "word" in case of
my_default_analyzer and "" in case of standard analyzer.

$ curl -XGET
'localhost:9200/myindex/_analyze?analyzer=my_default_analyzer&pretty=true'
-d 'Takanori'
{
"tokens" : [ {
"token" : "takanori",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
} ]

$ curl -XGET
'localhost:9200/myindex/_analyze?analyzer=standard&pretty=true' -d
'Takanori'
{
"tokens" : [ {
"token" : "takanori",
"start_offset" : 0,
"end_offset" : 8,
"type" : "",
"position" : 1
} ]
}

Any suggestion on how to make wild card search work in this case? Or may
be i am missing something in configuration?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Maybe I don't fully understand. But "Takanori" is written in Rōmaji. The
Kuromoji analyzer is for Kanji.

Best regards,

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Same behavior is observed if i use english word:

$curl -XGET
'localhost:9200/myindex/_analyze?analyzer=my_default_analyzer&pretty=true'
-d 'something'
{
"tokens" : [ {
"token" : "something",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
} ]

}

$ curl -XGET
'localhost:9200/myindex/_analyze?analyzer=standard&pretty=true' -d
'something'
{
"tokens" : [ {
"token" : "something",
"start_offset" : 0,
"end_offset" : 9,
"type" : "",
"position" : 1
} ]
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.