Cjk and thai analyzer customization

Hello,

we use elasticsearch 0.20 to index short texts in many languages. We have
configured custom analyzer - whitespace tokenizer and pattern filter in
index settings for most languages.
But there is a problem with Chinese, Japanese and Thai, cjk and thai
analyzer in ES is not suitable for our needs - they contain standard
tokenizer, which removes symbols and punctuation marks, we want to replace
standard tokenizer with whitespace tokenizer.
Please, can you give me an advice?
How can cjk and thai analyzer be customized in ES?
Is it possible to configure custom analyzer built from CJKBigramFilter or
ThaiWordFilter in index settings, or do we have to prepare a plugin or are
there other possibilities?

Thanks you.

Lukas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

If you look at the source of the CJKAnalyzer, you would notice that it
basically is a CJKTokenizer followed by a StopFilter. The heart of the
analyzer is the CJKTokenizer, not the standard tokenizer, so it simply
cannot be replaced. You can modify the source and build your own plugin. I
am assuming that most language analyzers are the same.

--
Ivan

On Thu, Jan 31, 2013 at 12:01 AM, lukas.stanek@memsource.com wrote:

Hello,

we use elasticsearch 0.20 to index short texts in many languages. We have
configured custom analyzer - whitespace tokenizer and pattern filter in
index settings for most languages.
But there is a problem with Chinese, Japanese and Thai, cjk and thai
analyzer in ES is not suitable for our needs - they contain standard
tokenizer, which removes symbols and punctuation marks, we want to replace
standard tokenizer with whitespace tokenizer.
Please, can you give me an advice?
How can cjk and thai analyzer be customized in ES?
Is it possible to configure custom analyzer built from CJKBigramFilter or
ThaiWordFilter in index settings, or do we have to prepare a plugin or are
there other possibilities?

Thanks you.

Lukas

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hello Ivan,

thank you for reply.
From Lucene version 3.6 CJKAnalyzer is composed of StandardTokenizer,
CJKWidthFilter, LowerCaseFilter, CJKBigramFilter and StopFilter
I would like to replace StandardTokenizer with WhitespaceTokenizer and
remove StopFilter.
In index settings like:
{
"analysis":{
"analyzer":{
"cjk_cust":{
"filter":[
"cjk_width", "lowercase", "cjk_bigram"
],
"type":"custom",
"tokenizer": "whitespace"
}
}
}
}

Can it be achieved in a simpler way than developing a new plugin with
custom analyzer for cjk and thai?

Lukas

On Thursday, January 31, 2013 4:22:30 PM UTC+1, Ivan Brusic wrote:

If you look at the source of the CJKAnalyzer, you would notice that it
basically is a CJKTokenizer followed by a StopFilter. The heart of the
analyzer is the CJKTokenizer, not the standard tokenizer, so it simply
cannot be replaced. You can modify the source and build your own plugin. I
am assuming that most language analyzers are the same.

--
Ivan

On Thu, Jan 31, 2013 at 12:01 AM, <lukas....@memsource.com <javascript:>>wrote:

Hello,

we use elasticsearch 0.20 to index short texts in many languages. We have
configured custom analyzer - whitespace tokenizer and pattern filter in
index settings for most languages.
But there is a problem with Chinese, Japanese and Thai, cjk and thai
analyzer in ES is not suitable for our needs - they contain standard
tokenizer, which removes symbols and punctuation marks, we want to replace
standard tokenizer with whitespace tokenizer.
Please, can you give me an advice?
How can cjk and thai analyzer be customized in ES?
Is it possible to configure custom analyzer built from CJKBigramFilter or
ThaiWordFilter in index settings, or do we have to prepare a plugin or are
there other possibilities?

Thanks you.

Lukas

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Lukas,

Sorry about the confusion. The CJKAnalyzer did in fact change with the 3.6
release. I still have 3.5 in my classpath. Interesting, the old analyzer is
now deprecated.

Your solution looks correct. I would swap the positioning of the lowercase
and cjk_width filters to be consistent with the original analyzer. You
might want to look at the pattern tokenizer if the whitespace tokenizer is
too lenient with word boundaries.

Cheers,

Ivan

On Thu, Jan 31, 2013 at 7:50 AM, lukas.stanek@memsource.com wrote:

Hello Ivan,

thank you for reply.
From Lucene version 3.6 CJKAnalyzer is composed of StandardTokenizer,
CJKWidthFilter, LowerCaseFilter, CJKBigramFilter and StopFilter
I would like to replace StandardTokenizer with WhitespaceTokenizer and
remove StopFilter.
In index settings like:
{
"analysis":{
"analyzer":{
"cjk_cust":{
"filter":[
"cjk_width", "lowercase", "cjk_bigram"
],
"type":"custom",
"tokenizer": "whitespace"
}
}
}
}

Can it be achieved in a simpler way than developing a new plugin with
custom analyzer for cjk and thai?

Lukas

On Thursday, January 31, 2013 4:22:30 PM UTC+1, Ivan Brusic wrote:

If you look at the source of the CJKAnalyzer, you would notice that it
basically is a CJKTokenizer followed by a StopFilter. The heart of the
analyzer is the CJKTokenizer, not the standard tokenizer, so it simply
cannot be replaced. You can modify the source and build your own plugin. I
am assuming that most language analyzers are the same.

--
Ivan

On Thu, Jan 31, 2013 at 12:01 AM, lukas....@memsource.com wrote:

Hello,

we use elasticsearch 0.20 to index short texts in many languages. We
have configured custom analyzer - whitespace tokenizer and pattern filter
in index settings for most languages.
But there is a problem with Chinese, Japanese and Thai, cjk and thai
analyzer in ES is not suitable for our needs - they contain standard
tokenizer, which removes symbols and punctuation marks, we want to replace
standard tokenizer with whitespace tokenizer.
Please, can you give me an advice?
How can cjk and thai analyzer be customized in ES?
Is it possible to configure custom analyzer built from CJKBigramFilter
or ThaiWordFilter in index settings, or do we have to prepare a plugin or
are there other possibilities?

Thanks you.

Lukas

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.