Combo analyzer - Issue with English and Japanese text being stored in same fields

hemant_pahilwani · February 1, 2013, 6:07pm

I am trying to add multilingual support to elastic search and part of the
requirement is to allow same field to store either english and japanese
text. While research i stumbled upon combo analyzer plugin where we can
give two analyzers on single field. Following is the configuration. I am
using standard analyzer for english and kuromoji for japanese:

index :
analysis :
analyzer :
my_combo :
type : combo
sub_analyzers : [standard, kuromoji]
deduplication : true

Evaluating japanese text yields correct results with kuromoji: curl -XGET
'localhost:9200/myindex/_analyze?analyzer=kuromoji&pretty=true' -d '最近どうですか'

But when analyzing with my_combo, it also applies standard analyzer to
japanese text which results in tokens being created for each japanese
character (behaviour of standard analyzer) as well as tokens created using
kuromoji .
curl -XGET 'localhost:9200/myindex/_analyze?analyzer= my_combo&pretty=true'
-d '最近どうですか'

Is there anyway in which elastic search can detect japanese language and
apply only kuromoji analyzer to japanese text? The other option that i was
considering was to use multi field type and store japanese text in
different field altogether but was wondering if there is easy way defined
in elastic search to do handle such scenarios?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · February 2, 2013, 8:18pm

Hey,

I'm not sure if this is sufficient for you but I would build my own
analyzer in the settings based on the japanese tokenizer and the
tokenfilters you need and base the tokenization on the JapaneseTokenizer.
It should tokenize the english text only on the whitespaces and then for
further processing I'd add lowercase, word-delimiter etc. as part of the
filter chain and only work with one analyzer.

does this make sense?

simon

On Friday, February 1, 2013 7:07:42 PM UTC+1, hemant pahilwani wrote:

I am trying to add multilingual support to Elasticsearch and part of the
requirement is to allow same field to store either english and japanese
text. While research i stumbled upon combo analyzer plugin where we can
give two analyzers on single field. Following is the configuration. I am
using standard analyzer for english and kuromoji for japanese:

index :
analysis :
analyzer :
my_combo :
type : combo
sub_analyzers : [standard, kuromoji]
deduplication : true

Evaluating japanese text yields correct results with kuromoji: curl
-XGET 'localhost:9200/myindex/_analyze?analyzer=kuromoji&pretty=true' -d
'最近どうですか'

But when analyzing with my_combo, it also applies standard analyzer to
japanese text which results in tokens being created for each japanese
character (behaviour of standard analyzer) as well as tokens created using
kuromoji .
curl -XGET
'localhost:9200/myindex/_analyze?analyzer= my_combo&pretty=true' -d
'最近どうですか'

Is there anyway in which Elasticsearch can detect japanese language and
apply only kuromoji analyzer to japanese text? The other option that i was
considering was to use multi field type and store japanese text in
different field altogether but was wondering if there is easy way defined
in Elasticsearch to do handle such scenarios?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

hemant_pahilwani · February 7, 2013, 5:17am

Hey...i tried out the suggested approach but ran into issue with how wild
card search works. I have been using standard analyzer and its default
filters to do tokenization as i require regular prefix wild card search on
english words. For example if i have "takanori" stored in index then
searching for tak* gives me document with takanori. Sample query below:

{query:{"query_string" : {"query" : "tak*","fields" : [
"firstName"],"use_dis_max" : true,"analyze_wildcard" : true}}}

Based on the suggestion above i created the custom analyzer with japanese
tokenizer and added standard filter to it and used mapped firstName field
to use it:

        my_default_analyzer : 
            type : custom
            tokenizer : kuromoji_tokenizer
            filter : [kuromoji_baseform, kuromoji_part_of_speech,

kuromoji_readingform, kuromoji_stemmer, standard]

But running the same query having wild card search doesn't give me back any
results. It does give me back results if i search for while string
"takanori"

My suspicion is the type that is getting associated - "word" in case of
my_default_analyzer and "" in case of standard analyzer.

$ curl -XGET
'localhost:9200/myindex/_analyze?analyzer=my_default_analyzer&pretty=true'
-d 'Takanori'
{
"tokens" : [ {
"token" : "takanori",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
} ]

$ curl -XGET
'localhost:9200/myindex/_analyze?analyzer=standard&pretty=true' -d
'Takanori'
{
"tokens" : [ {
"token" : "takanori",
"start_offset" : 0,
"end_offset" : 8,
"type" : "",
"position" : 1
} ]
}

Any suggestion on how to make wild card search work in this case? Or may
be i am missing something in configuration?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · February 7, 2013, 9:13am

Maybe I don't fully understand. But "Takanori" is written in Rōmaji. The
Kuromoji analyzer is for Kanji.

Best regards,

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

hemant_pahilwani · February 8, 2013, 5:31pm

Same behavior is observed if i use english word:

$curl -XGET
'localhost:9200/myindex/_analyze?analyzer=my_default_analyzer&pretty=true'
-d 'something'
{
"tokens" : [ {
"token" : "something",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
} ]

}

$ curl -XGET
'localhost:9200/myindex/_analyze?analyzer=standard&pretty=true' -d
'something'
{
"tokens" : [ {
"token" : "something",
"start_offset" : 0,
"end_offset" : 9,
"type" : "",
"position" : 1
} ]
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
(Plugin Kuromoji) Can you help me resolve config elasticsearch.yml create analyzer? 日本語による質問・議論はこちら	5	1664	July 6, 2017
Adding support for multi-language partial matching querying Elasticsearch language-clients	2	442	November 22, 2023
Kuromoji analyzer filters out text in Arabic Elasticsearch	1	165	October 26, 2021
Need Help with Japanese analyzer - (Kuromoji) Elasticsearch	1	363	July 6, 2017
How to set a analyser for all the fields and the "_all" field Elasticsearch	5	1848	July 5, 2017

Combo analyzer - Issue with English and Japanese text being stored in same fields

Related topics