we use elasticsearch 0.20 to index short texts in many languages. We have
configured custom analyzer - whitespace tokenizer and pattern filter in
index settings for most languages.
But there is a problem with Chinese, Japanese and Thai, cjk and thai
analyzer in ES is not suitable for our needs - they contain standard
tokenizer, which removes symbols and punctuation marks, we want to replace
standard tokenizer with whitespace tokenizer.
Please, can you give me an advice?
How can cjk and thai analyzer be customized in ES?
Is it possible to configure custom analyzer built from CJKBigramFilter or
ThaiWordFilter in index settings, or do we have to prepare a plugin or are
there other possibilities?
If you look at the source of the CJKAnalyzer, you would notice that it
basically is a CJKTokenizer followed by a StopFilter. The heart of the
analyzer is the CJKTokenizer, not the standard tokenizer, so it simply
cannot be replaced. You can modify the source and build your own plugin. I
am assuming that most language analyzers are the same.
we use elasticsearch 0.20 to index short texts in many languages. We have
configured custom analyzer - whitespace tokenizer and pattern filter in
index settings for most languages.
But there is a problem with Chinese, Japanese and Thai, cjk and thai
analyzer in ES is not suitable for our needs - they contain standard
tokenizer, which removes symbols and punctuation marks, we want to replace
standard tokenizer with whitespace tokenizer.
Please, can you give me an advice?
How can cjk and thai analyzer be customized in ES?
Is it possible to configure custom analyzer built from CJKBigramFilter or
ThaiWordFilter in index settings, or do we have to prepare a plugin or are
there other possibilities?
thank you for reply.
From Lucene version 3.6 CJKAnalyzer is composed of StandardTokenizer,
CJKWidthFilter, LowerCaseFilter, CJKBigramFilter and StopFilter
I would like to replace StandardTokenizer with WhitespaceTokenizer and
remove StopFilter.
In index settings like:
{
"analysis":{
"analyzer":{
"cjk_cust":{
"filter":[
"cjk_width", "lowercase", "cjk_bigram"
],
"type":"custom",
"tokenizer": "whitespace"
}
}
}
}
Can it be achieved in a simpler way than developing a new plugin with
custom analyzer for cjk and thai?
Lukas
On Thursday, January 31, 2013 4:22:30 PM UTC+1, Ivan Brusic wrote:
If you look at the source of the CJKAnalyzer, you would notice that it
basically is a CJKTokenizer followed by a StopFilter. The heart of the
analyzer is the CJKTokenizer, not the standard tokenizer, so it simply
cannot be replaced. You can modify the source and build your own plugin. I
am assuming that most language analyzers are the same.
we use elasticsearch 0.20 to index short texts in many languages. We have
configured custom analyzer - whitespace tokenizer and pattern filter in
index settings for most languages.
But there is a problem with Chinese, Japanese and Thai, cjk and thai
analyzer in ES is not suitable for our needs - they contain standard
tokenizer, which removes symbols and punctuation marks, we want to replace
standard tokenizer with whitespace tokenizer.
Please, can you give me an advice?
How can cjk and thai analyzer be customized in ES?
Is it possible to configure custom analyzer built from CJKBigramFilter or
ThaiWordFilter in index settings, or do we have to prepare a plugin or are
there other possibilities?
Thanks you.
Lukas
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.
Sorry about the confusion. The CJKAnalyzer did in fact change with the 3.6
release. I still have 3.5 in my classpath. Interesting, the old analyzer is
now deprecated.
Your solution looks correct. I would swap the positioning of the lowercase
and cjk_width filters to be consistent with the original analyzer. You
might want to look at the pattern tokenizer if the whitespace tokenizer is
too lenient with word boundaries.
thank you for reply.
From Lucene version 3.6 CJKAnalyzer is composed of StandardTokenizer,
CJKWidthFilter, LowerCaseFilter, CJKBigramFilter and StopFilter
I would like to replace StandardTokenizer with WhitespaceTokenizer and
remove StopFilter.
In index settings like:
{
"analysis":{
"analyzer":{
"cjk_cust":{
"filter":[
"cjk_width", "lowercase", "cjk_bigram"
],
"type":"custom",
"tokenizer": "whitespace"
}
}
}
}
Can it be achieved in a simpler way than developing a new plugin with
custom analyzer for cjk and thai?
Lukas
On Thursday, January 31, 2013 4:22:30 PM UTC+1, Ivan Brusic wrote:
If you look at the source of the CJKAnalyzer, you would notice that it
basically is a CJKTokenizer followed by a StopFilter. The heart of the
analyzer is the CJKTokenizer, not the standard tokenizer, so it simply
cannot be replaced. You can modify the source and build your own plugin. I
am assuming that most language analyzers are the same.
we use elasticsearch 0.20 to index short texts in many languages. We
have configured custom analyzer - whitespace tokenizer and pattern filter
in index settings for most languages.
But there is a problem with Chinese, Japanese and Thai, cjk and thai
analyzer in ES is not suitable for our needs - they contain standard
tokenizer, which removes symbols and punctuation marks, we want to replace
standard tokenizer with whitespace tokenizer.
Please, can you give me an advice?
How can cjk and thai analyzer be customized in ES?
Is it possible to configure custom analyzer built from CJKBigramFilter
or ThaiWordFilter in index settings, or do we have to prepare a plugin or
are there other possibilities?
Thanks you.
Lukas
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.