Asian characters and not words are tokenized - CJK Analysis and Tokenization Problems

Wolf_2 · March 10, 2011, 4:00am

I am using the CJK Analyzer to index Japanese language content into
elasticsearch: However the the input is not tokenized into words but
rather characters. The CJK analyzer does not have this problem when
using a stand alone lucene demonstration program, pp. 147, "Lucene in
Action" second edition, Manning 2010. I also do not have any problems
with Sen in an independent Java demonstration. The following curl
inputs demonstrate my problem.

curl -XPUT 'http://localhost:9200/twitter_jp' -d '{
"index" : {
"numberOfShards" : 1,
"numberOfReplicas" : 1,
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"type" : "cjk"
}
}
}
}
}'

curl -XPUT 'http://localhost:9200/twitter_jp/tweet/1' -d '{
"user": "wolfkden",
"postDate": "Tue Mar 8 22:56:57 PST 2011",
"text": "トヨタ自動車は、ハイブリッド車（ＨＶ）「プリウス」のワゴン型試作車「プリウス・スペースコンセプト」を国内で初めて公開
した。"
}'

curl -XPUT 'http://localhost:9200/twitter_jp/tweet/2' -d '{
"user": "wolfkden",
"postDate": "Tue Mar 8 22:56:57 PST 2011",
"text": "自動車メーカーで生産世界首位。国内販売で「プリウス」が好調も、世界的には大規模リコール問題での信頼回復が急務"
}'

curl -XGET 'http://localhost:9200/twitter_jp/tweet/_search' -d '{
"query" : {
"match_all" : { }
},
"facets" : {
"tag" : {
"terms" : {
"field" : "text",
"size" : 30
}
}
}
}'

the output provides counts for individual characters rather than
words:

"facets":{"tag":{"_type":"terms","missing":0,"terms":
[{"term":"車","count":2},{"term":"自","count":2},{"term":"国","count":2},
{"term":"動","count":2},{"term":"内","count":2},{"term":"ー","count":2},
{"term":"リ","count":2},{"term":"プ","count":2},{"term":"ス","count":2},
{"term":"コ","count":2},{"term":"ウ","count":2},{"term":"は","count":2},
{"term":"の","count":2},{"term":"で","count":2},{"term":"ｈｖ","count":1},
{"term":"首","count":1},{"term":"題","count":1},{"term":"頼","count":1},
{"term":"開","count":1},{"term":"販","count":1},{"term":"調","count":1},
{"term":"試","count":1},{"term":"規","count":1},{"term":"的","count":1},
{"term":"界","count":1},{"term":"産","count":1},{"term":"生","count":1},
{"term":"模","count":1},{"term":"急","count":1},{"term":"復","count":1},
{"term":"好","count":1},{"term":"大","count":1},{"term":"売","count":1},
{"term":"型","count":1},{"term":"回","count":1},{"term":"問","count":1},
{"term":"務","count":1},{"term":"初","count":1},{"term":"公","count":1},
{"term":"信","count":1}]}}}

In fact the word プリウス (Prius for toyota Prius) occurs in one of the
documents. I only expect the sen software to pick this prius test out
but the cjk analyzer should tokenize the two character words
correctly.

I checked the elasticsearch code and noted that the cjk analyzer type
is instanced through the CjkAnalysisProvider in the
AnalysisModule.processAnalyzer method. In the IndicesAnalysisService
the cjk setting loads the ChineseAnalyzer rather than the CJKAnalyzer,
I thought this was a problem in the code and made the correction but I
still have the same problem.

All in all though elasticsearch still does not tokenize east asian
words correctly.

Presently my efforts are focusing on the fact that the token stream
from the analyzer is not used and hence the CjK analyzer is not
implemented although the bindings are set.

Could I get some more insight into this to solve my problem.

Regards,

Wolf

kimchy · March 10, 2011, 6:19am

I did not see in where you created mappings that marked the CJK analyzer as either the default analyzer or the analyzer for text.
On Thursday, March 10, 2011 at 6:00 AM, Wolf wrote:

I am using the CJK Analyzer to index Japanese language content into
elasticsearch: However the the input is not tokenized into words but
rather characters. The CJK analyzer does not have this problem when
using a stand alone lucene demonstration program, pp. 147, "Lucene in
Action" second edition, Manning 2010. I also do not have any problems
with Sen in an independent Java demonstration. The following curl
inputs demonstrate my problem.

curl -XPUT 'http://localhost:9200/twitter_jp' -d '{
"index" : {
"numberOfShards" : 1,
"numberOfReplicas" : 1,
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"type" : "cjk"
}
}
}
}
}'

curl -XPUT 'http://localhost:9200/twitter_jp/tweet/1' -d '{
"user": "wolfkden",
"postDate": "Tue Mar 8 22:56:57 PST 2011",
"text": "ãã¨ã¿èªåè»ã¯ããã¤ããªããè»ï¼ï¼¨ï¼¶ï¼ãããªã¦ã¹ãã®ã¯ã´ã³åè©¦ä½è»ãããªã¦ã¹ã»ã¹ãã¼ã¹ã³ã³ã»ããããå½åã§åãã¦å¬é
ããã"
}'

curl -XPUT 'http://localhost:9200/twitter_jp/tweet/2' -d '{
"user": "wolfkden",
"postDate": "Tue Mar 8 22:56:57 PST 2011",
"text": "èªåè»ã¡ã¼ã«ã¼ã§çç£ä¸çé¦ä½ãå½åè²©å£²ã§ãããªã¦ã¹ããå¥½èª¿ããä¸ççã«ã¯å¤§è¦æ¨¡ãªã³ã¼ã«åé¡ã§ã®ä¿¡é ¼åå¾©ãæ¥å"
}'

curl -XGET 'http://localhost:9200/twitter_jp/tweet/_search' -d '{
"query" : {
"match_all" : { }
},
"facets" : {
"tag" : {
"terms" : {
"field" : "text",
"size" : 30
}
}
}
}'

the output provides counts for individual characters rather than
words:

"facets":{"tag":{"_type":"terms","missing":0,"terms":
[{"term":"è»","count":2},{"term":"èª","count":2},{"term":"å½","count":2},
{"term":"å","count":2},{"term":"å","count":2},{"term":"ã¼","count":2},
{"term":"ãª","count":2},{"term":"ã","count":2},{"term":"ã¹","count":2},
{"term":"ã³","count":2},{"term":"ã¦","count":2},{"term":"ã¯","count":2},
{"term":"ã®","count":2},{"term":"ã§","count":2},{"term":"ï½ï½","count":1},
{"term":"é¦","count":1},{"term":"é¡","count":1},{"term":"é ¼","count":1},
{"term":"é","count":1},{"term":"è²©","count":1},{"term":"èª¿","count":1},
{"term":"è©¦","count":1},{"term":"è¦","count":1},{"term":"ç","count":1},
{"term":"ç","count":1},{"term":"ç£","count":1},{"term":"ç","count":1},
{"term":"æ¨¡","count":1},{"term":"æ¥","count":1},{"term":"å¾©","count":1},
{"term":"å¥½","count":1},{"term":"å¤§","count":1},{"term":"å£²","count":1},
{"term":"å","count":1},{"term":"å","count":1},{"term":"å","count":1},
{"term":"å","count":1},{"term":"å","count":1},{"term":"å¬","count":1},
{"term":"ä¿¡","count":1}]}}}

In fact the word ããªã¦ã¹ (Prius for toyota Prius) occurs in one of the
documents. I only expect the sen software to pick this prius test out
but the cjk analyzer should tokenize the two character words
correctly.

I checked the elasticsearch code and noted that the cjk analyzer type
is instanced through the CjkAnalysisProvider in the
AnalysisModule.processAnalyzer method. In the IndicesAnalysisService
the cjk setting loads the ChineseAnalyzer rather than the CJKAnalyzer,
I thought this was a problem in the code and made the correction but I
still have the same problem.

All in all though elasticsearch still does not tokenize east asian
words correctly.

Presently my efforts are focusing on the fact that the token stream
from the analyzer is not used and hence the CjK analyzer is not
implemented although the bindings are set.

Could I get some more insight into this to solve my problem.

Regards,

Wolf

Wolf_2 · March 10, 2011, 9:21am

Thank you Shay for your quick response and personal consideration.

I noted your comment on the mapping and followed up with a mapping
declaration after the index declaration:

curl -XPUT 'http://localhost:9200/twitter_jp/_mapping' -d '{
"tweet": {
"properties": {
"user":{"type":"string"},
"postDate":{"type":"string"},
"text": { "type": "string", "index": "analyzed", "analyzer" :
"cjk" }
}
}
}'

I also executed the text field mapping with "search_analyzer" : "cjk" and
with
"index": "not_analyzed", "search_analyzer": "cjk"

switching off the indexing to not_analyzed caused no tokenization to occur
with the facet response being two terms one for each document. The
interpreter is reading the cjk property since if I make a mispelling I get a
guice error.

these results used version 0.15.2.

I also compiled a new version of Elasticsearch to use the CJKAnalyzer for
binding against the cjk parameter in the IndicesAnalysisService class. This
version produced the same results "I used guice comments to verify that the
binding code executed".

I am still concerned that I cannot find where the CJKAnalyzer.tokenStream is
referenced in the code. Possibly the default "standard" tokenizer is being
used.

Please let me know what else is possibly wrong with my scripts or other
remedies that I can pursue.

Regards,

Wolf

On Wed, Mar 9, 2011 at 10:19 PM, Shay Banon shay.banon@elasticsearch.comwrote:

I did not see in where you created mappings that marked the CJK analyzer
as either the default analyzer or the analyzer for text.

On Thursday, March 10, 2011 at 6:00 AM, Wolf wrote:

I am using the CJK Analyzer to index Japanese language content into
elasticsearch: However the the input is not tokenized into words but
rather characters. The CJK analyzer does not have this problem when
using a stand alone lucene demonstration program, pp. 147, "Lucene in
Action" second edition, Manning 2010. I also do not have any problems
with Sen in an independent Java demonstration. The following curl
inputs demonstrate my problem.

curl -XPUT 'http://localhost:9200/twitter_jp' -d '{
"index" : {
"numberOfShards" : 1,
"numberOfReplicas" : 1,
"analysis" : {
"analyzer" : {
""my_analyzer" : {
"type" : "cjk"
}
}
}
}
}'

curl -XPUT 'http://localhost:9200/twitter_jp/tweet/1' -d '{
"user": "wolfkden",
"postDate": "Tue Mar 8 22:56:57 PST 2011",
"text": "トヨタ自動車は、ハイブリッド車（ＨＶ）「プリウス」のワゴン型試作車「プリウス・スペースコンセプト」を国内で初めて公開
した。"
}'

curl -XPUT 'http://localhost:9200/twitter_jp/tweet/2' -d '{
"user": "wolfkden",
"postDate": "Tue Mar 8 22:56:57 PST 2011",
"text": "自動車メーカーで生産世界首位。国内販売で「プリウス」が好調も、世界的には大規模リコール問題での信頼回復が急務"
}'

curl -XGET 'http://localhost:9200/twitter_jp/tweet/_search' -d '{
"query" : {
"match_all" : { }}
},
"facets" : {
"tag" : {
"terms" : {
"field" : "text",
"size" : 30
}
}
}
}'

the output provides counts for individual characters rather than
words:

"facets":{"tag":{"_type":"terms","missing":0,"terms":
[{"term":"車","count":2},{"term":"自","count":2},{"term":"国","count":2},
{"term":"動","count":2},{"term":"内","count":2},{"term":"ー","count":2},
{"term":"リ","count":2},{"term":"プ","count":2},{"term":"ス","count":2},
{"term":"コ","count":2},{"term":"ウ","count":2},{"term":"は","count":2},
{"term":"の","count":2},{"term":"で","count":2},{"term":"ｈｖ","count":1},
{"term":"首","count":1},{"term":"題","count":1},{"term":"頼","count":1},
{"term":"開","count":1},{"term":"販","count":1},{"term":"調","count":1},
{"term":"試","count":1},{"term":"規","count":1},{"term":"的","count":1},
{"term":"界","count":1},{"term":"産","count":1},{"term":"生","count":1},
{"term":"模","count":1},{"term":"急","count":1},{"term":"復","count":1},
{"term":"好","count":1},{"term":"大","count":1},{"term":"売","count":1},
{"term":"型","count":1},{"term":"回","count":1},{"term":"問","count":1},
{"term":"務","count":1},{"term":"初","count":1},{"term":"公","count":1},
{"term":"信","count":1}]}}}

In fact the word プリウス (Prius for toyota Prius) occurs in one of the
documents. I only expect the sen software to pick this prius test out
but the cjk analyzer should tokenize the two character words
correctly.

I checked the elasticsearch code and noted that the cjk analyzer type
is instanced through the CjkAnalysisProvider in the
AnalysisModule.processAnalyzer method. In the IndicesAnalysisService
the cjk setting loads the ChineseAnalyzer rather than the CJKAnalyzer,
I thought this was a problem in the code and made the correction but I
still have the same problem.

All in all though elasticsearch still does not tokenize east asian
words correctly.

Presently my efforts are focusing on the fact that the token stream
from the analyzer is not used and hence the CjK analyzer is not
implemented although the bindings are set.

Could I get some more insight into this to solve my problem.

Regards,

Wolf

Wolf_2 · March 10, 2011, 6:34pm

Thank you Shay for your quick response and personal consideration.

I noted your comment on the mapping and followed up with a mapping
declaration after the index declaration:

curl -XPUT 'http://localhost:9200/twitter_jp/_mapping' -d '{
"tweet": {
"properties": {
"user":{"type":"string"},
"postDate":{"type":"string"},
"text": { "type": "string", "index": "analyzed",
"analyzer" : "cjk" }
}
}
}'

I also executed the text field mapping with "search_analyzer" : "cjk"
and with
"index": "not_analyzed", "search_analyzer": "cjk"

switching off the indexing to not_analyzed caused no tokenization to
occur with the facet response being two terms one for each document.
The interpreter is reading the cjk property since if I make a
mispelling I get a guice error.

these results used version 0.15.2.

I also compiled a new version of Elasticsearch to use the CJKAnalyzer
for binding against the cjk parameter in the IndicesAnalysisService
class. This version produced the same results "I used guice comments
to verify that the binding code executed".

I am still concerned that I cannot find where the
CJKAnalyzer.tokenStream is referenced in the code. Possibly the
default "standard" tokenizer is being used.

Please let me know what else is possibly wrong with my scripts or
other remedies that I can pursue.

Regards,

Wolf

On Mar 9, 8:00 pm, Wolf wolfk...@gmail.com wrote:

I am using the CJK Analyzer to index Japanese language content into
elasticsearch: However the the input is not tokenized into words but
rather characters. The CJK analyzer does not have this problem when
using a stand alone lucene demonstration program, pp. 147, "Lucene in
Action" second edition, Manning 2010. I also do not have any problems
with Sen in an independent Java demonstration. The following curl
inputs demonstrate my problem.

curl -XPUT 'http://localhost:9200/twitter_jp'-d '{
"index" : {
"numberOfShards" : 1,
"numberOfReplicas" : 1,
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"type" : "cjk"
}
}
}
}

}'

curl -XPUT 'http://localhost:9200/twitter_jp/tweet/1'-d '{
"user": "wolfkden",
"postDate": "Tue Mar 8 22:56:57 PST 2011",
"text": "トヨタ自動車は、ハイブリッド車（ＨＶ）「プリウス」のワゴン型試作車「プリウス・スペースコンセプト」を国内で初めて公開
した。"

}'

curl -XPUT 'http://localhost:9200/twitter_jp/tweet/2'-d '{
"user": "wolfkden",
"postDate": "Tue Mar 8 22:56:57 PST 2011",
"text": "自動車メーカーで生産世界首位。国内販売で「プリウス」が好調も、世界的には大規模リコール問題での信頼回復が急務"

}'

curl -XGET 'http://localhost:9200/twitter_jp/tweet/_search'-d '{
"query" : {
"match_all" : { }
},
"facets" : {
"tag" : {
"terms" : {
"field" : "text",
"size" : 30
}
}
}

}'

the output provides counts for individual characters rather than
words:

"facets":{"tag":{"_type":"terms","missing":0,"terms":
[{"term":"車","count":2},{"term":"自","count":2},{"term":"国","count":2},
{"term":"動","count":2},{"term":"内","count":2},{"term":"ー","count":2},
{"term":"リ","count":2},{"term":"プ","count":2},{"term":"ス","count":2},
{"term":"コ","count":2},{"term":"ウ","count":2},{"term":"は","count":2},
{"term":"の","count":2},{"term":"で","count":2},{"term":"ｈｖ","count":1},
{"term":"首","count":1},{"term":"題","count":1},{"term":"頼","count":1},
{"term":"開","count":1},{"term":"販","count":1},{"term":"調","count":1},
{"term":"試","count":1},{"term":"規","count":1},{"term":"的","count":1},
{"term":"界","count":1},{"term":"産","count":1},{"term":"生","count":1},
{"term":"模","count":1},{"term":"急","count":1},{"term":"復","count":1},
{"term":"好","count":1},{"term":"大","count":1},{"term":"売","count":1},
{"term":"型","count":1},{"term":"回","count":1},{"term":"問","count":1},
{"term":"務","count":1},{"term":"初","count":1},{"term":"公","count":1},
{"term":"信","count":1}]}}}

In fact the word プリウス (Prius for toyota Prius) occurs in one of the
documents. I only expect the sen software to pick this prius test out
but the cjk analyzer should tokenize the two character words
correctly.

I checked the elasticsearch code and noted that the cjk analyzer type
is instanced through the CjkAnalysisProvider in the
AnalysisModule.processAnalyzer method. In the IndicesAnalysisService
the cjk setting loads the ChineseAnalyzer rather than the CJKAnalyzer,
I thought this was a problem in the code and made the correction but I
still have the same problem.

All in all though elasticsearch still does not tokenize east asian
words correctly.

Presently my efforts are focusing on the fact that the token stream
from the analyzer is not used and hence the CjK analyzer is not
implemented although the bindings are set.

Could I get some more insight into this to solve my problem.

Regards,

Wolf

Wolf_2 · March 10, 2011, 7:53pm

Thank you Shay for your quick response and personal consideration.

I added more specif analysis mapping for the text field but problems
persist.

I noted your comment on the mapping and followed up with a mapping
declaration after the index declaration:

curl -XPUT 'http://localhost:9200/twitter_jp/_mapping' -d '{
"tweet": {
"properties": {
"user":{"type":"string"},
"postDate":{"type":"string"},
"text": { "type": "string", "index": "analyzed",
"analyzer" :
"cjk" }
}
}

}'

I also executed the text field mapping with "search_analyzer" : "cjk"
and
with
"index": "not_analyzed", "search_analyzer": "cjk"

switching off the indexing to not_analyzed caused no tokenization to
occur
with the facet response being two terms one for each document. The
interpreter is reading the cjk property since if I make a mispelling I
get a
guice error.

these results used version 0.15.2.

I also compiled a new version of Elasticsearch to use the CJKAnalyzer
for
binding against the cjk parameter in the IndicesAnalysisService class.
This
version produced the same results "I used guice comments to verify
that the
binding code executed".

I am still concerned that I cannot find where the
CJKAnalyzer.tokenStream is
referenced in the code. Possibly the default "standard" tokenizer is
being
used.

Please let me know what else is possibly wrong with my scripts or
other
remedies that I can pursue.

Regards,

Wolf

On Mar 9, 8:00 pm, Wolf wolfk...@gmail.com wrote:

I am using the CJK Analyzer to index Japanese language content into
elasticsearch: However the the input is not tokenized into words but
rather characters. The CJK analyzer does not have this problem when
using a stand alone lucene demonstration program, pp. 147, "Lucene in
Action" second edition, Manning 2010. I also do not have any problems
with Sen in an independent Java demonstration. The following curl
inputs demonstrate my problem.

curl -XPUT 'http://localhost:9200/twitter_jp'-d '{
"index" : {
"numberOfShards" : 1,
"numberOfReplicas" : 1,
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"type" : "cjk"
}
}
}
}

}'

curl -XPUT 'http://localhost:9200/twitter_jp/tweet/1'-d '{
"user": "wolfkden",
"postDate": "Tue Mar 8 22:56:57 PST 2011",
"text": "トヨタ自動車は、ハイブリッド車（ＨＶ）「プリウス」のワゴン型試作車「プリウス・スペースコンセプト」を国内で初めて公開
した。"

}'

curl -XPUT 'http://localhost:9200/twitter_jp/tweet/2'-d '{
"user": "wolfkden",
"postDate": "Tue Mar 8 22:56:57 PST 2011",
"text": "自動車メーカーで生産世界首位。国内販売で「プリウス」が好調も、世界的には大規模リコール問題での信頼回復が急務"

}'

curl -XGET 'http://localhost:9200/twitter_jp/tweet/_search'-d '{
"query" : {
"match_all" : { }
},
"facets" : {
"tag" : {
"terms" : {
"field" : "text",
"size" : 30
}
}
}

}'

the output provides counts for individual characters rather than
words:

"facets":{"tag":{"_type":"terms","missing":0,"terms":
[{"term":"車","count":2},{"term":"自","count":2},{"term":"国","count":2},
{"term":"動","count":2},{"term":"内","count":2},{"term":"ー","count":2},
{"term":"リ","count":2},{"term":"プ","count":2},{"term":"ス","count":2},
{"term":"コ","count":2},{"term":"ウ","count":2},{"term":"は","count":2},
{"term":"の","count":2},{"term":"で","count":2},{"term":"ｈｖ","count":1},
{"term":"首","count":1},{"term":"題","count":1},{"term":"頼","count":1},
{"term":"開","count":1},{"term":"販","count":1},{"term":"調","count":1},
{"term":"試","count":1},{"term":"規","count":1},{"term":"的","count":1},
{"term":"界","count":1},{"term":"産","count":1},{"term":"生","count":1},
{"term":"模","count":1},{"term":"急","count":1},{"term":"復","count":1},
{"term":"好","count":1},{"term":"大","count":1},{"term":"売","count":1},
{"term":"型","count":1},{"term":"回","count":1},{"term":"問","count":1},
{"term":"務","count":1},{"term":"初","count":1},{"term":"公","count":1},
{"term":"信","count":1}]}}}

In fact the word プリウス (Prius for toyota Prius) occurs in one of the
documents. I only expect the sen software to pick this prius test out
but the cjk analyzer should tokenize the two character words
correctly.

I checked the elasticsearch code and noted that the cjk analyzer type
is instanced through the CjkAnalysisProvider in the
AnalysisModule.processAnalyzer method. In the IndicesAnalysisService
the cjk setting loads the ChineseAnalyzer rather than the CJKAnalyzer,
I thought this was a problem in the code and made the correction but I
still have the same problem.

All in all though elasticsearch still does not tokenize east asian
words correctly.

Presently my efforts are focusing on the fact that the token stream
from the analyzer is not used and hence the CjK analyzer is not
implemented although the bindings are set.

Could I get some more insight into this to solve my problem.

Regards,

Wolf

cwho · March 11, 2011, 6:03am

The prebuilt "cjk" analyzer does indeed use ChineseAnalyzer instead of
CJKAnalyzer.

Instead of using the prebuilt "cjk", try defining a new analyzer
"my_cjk" in your configuration pointing to CjkAnalyzerProvider
instead:

index:
analysis:
analyzer:
my_cjk:
type:
org.elasticsearch.index.analysis.CjkAnalyzerProvider

Then following your example, it appears to be working for me - now
doing the bigram character words that the CJKAnalyzer uses. I'm using
0.15.1.

curl -XPUT 'http://localhost:9200/twitter_jp' -d '{
"index" : {
"numberOfShards" : 1,
"numberOfReplicas" : 1
}
}'

curl -XPUT 'http://localhost:9200/twitter_jp/_mapping' -d '{
"tweet": {
"properties": {
"user":{"type":"string"},
"postDate":{"type":"string"},
"text": { "type": "string", "index": "analyzed",
"analyzer" : "my_cjk" }
}
}
}'

curl -XPUT 'http://localhost:9200/twitter_jp/tweet/1' -d '{
"user": "wolfkden",
"postDate": "Tue Mar 8 22:56:57 PST 2011",
"text": "トヨタ自動車は、ハイブリッド車（ＨＶ）「プリウス」のワゴン型試作車「プリウス・スペースコンセプト」を国内で初めて公開
した。"
}'

curl -XGET 'http://localhost:9200/twitter_jp/tweet/_search' -d '{
"query" : {
"match_all" : { }
},
"facets" : {
"tag" : {
"terms" : {
"field" : "text",
"size" : 30
}
}
}
}'

{"took":182,"timed_out":false,"_shards":{"total":1,"successful":
1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":
[{"_index":"twitter_jp","_type":"tweet","_id":"1","_score":1.0,
"_source" : {
"user": "wolfkden",
"postDate": "Tue Mar 8 22:56:57 PST 2011",
"text": "トヨタ自動車は、ハイブリッド車（ＨＶ）「プリウス」のワゴン型試作車「プリウス・スペースコンセプト」を国内で初めて公開
した。"
}}]},"facets":{"tag":{"_type":"terms","missing":0,"terms":[{"term":"開
し","count":1},{"term":"車は","count":1},{"term":"試作","count":1},
{"term":"自動","count":1},{"term":"型試","count":1},{"term":"国内","count":
1},{"term":"動車","count":1},{"term":"初め","count":1},{"term":"内
で","count":1},{"term":"公開","count":1},{"term":"作車","count":1},
{"term":"ース","count":1},{"term":"ン型","count":1},{"term":"ンセ","count":
1},{"term":"ワゴ","count":1},{"term":"リッ","count":1},{"term":"リ
ウ","count":1},{"term":"ヨタ","count":1},{"term":"ペー","count":1},
{"term":"プリ","count":1},{"term":"プト","count":1},{"term":"ブリ","count":
1},{"term":"ハイ","count":1},{"term":"ド車","count":1},{"term":"ト
ヨ","count":1},{"term":"ッド","count":1},{"term":"タ自","count":1},
{"term":"セプ","count":1},{"term":"スペ","count":1},{"term":"スコ","count":
1}]}}}

Hope this helps.

-C

On Mar 11, 3:53 am, Wolf wolfk...@gmail.com wrote:

Thank you Shay for your quick response and personal consideration.

I added more specif analysis mapping for the text field but problems
persist.

I noted your comment on the mapping and followed up with a mapping
declaration after the index declaration:

curl -XPUT 'http://localhost:9200/twitter_jp/_mapping'-d '{
"tweet": {
"properties": {
"user":{"type":"string"},
"postDate":{"type":"string"},
"text": { "type": "string", "index": "analyzed",
"analyzer" :
"cjk" }
}
}

}'

I also executed the text field mapping with "search_analyzer" : "cjk"
and
with
"index": "not_analyzed", "search_analyzer": "cjk"

switching off the indexing to not_analyzed caused no tokenization to
occur
with the facet response being two terms one for each document. The
interpreter is reading the cjk property since if I make a mispelling I
get a
guice error.

these results used version 0.15.2.

I also compiled a new version of Elasticsearch to use the CJKAnalyzer
for
binding against the cjk parameter in the IndicesAnalysisService class.
This
version produced the same results "I used guice comments to verify
that the
binding code executed".

I am still concerned that I cannot find where the
CJKAnalyzer.tokenStream is
referenced in the code. Possibly the default "standard" tokenizer is
being
used.

Please let me know what else is possibly wrong with my scripts or
other
remedies that I can pursue.

Regards,

Wolf

On Mar 9, 8:00 pm, Wolf wolfk...@gmail.com wrote:

I am using the CJK Analyzer to index Japanese language content into
elasticsearch: However the the input is not tokenized into words but
rather characters. The CJK analyzer does not have this problem when
using a stand alone lucene demonstration program, pp. 147, "Lucene in
Action" second edition, Manning 2010. I also do not have any problems
with Sen in an independent Java demonstration. The following curl
inputs demonstrate my problem.

curl -XPUT 'http://localhost:9200/twitter_jp'-d'{
"index" : {
"numberOfShards" : 1,
"numberOfReplicas" : 1,
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"type" : "cjk"
}
}
}
}

}'

curl -XPUT 'http://localhost:9200/twitter_jp/tweet/1'-d'{
"user": "wolfkden",
"postDate": "Tue Mar 8 22:56:57 PST 2011",
"text": "トヨタ自動車は、ハイブリッド車（ＨＶ）「プリウス」のワゴン型試作車「プリウス・スペースコンセプト」を国内で初めて公開
した。"

}'

curl -XPUT 'http://localhost:9200/twitter_jp/tweet/2'-d'{
"user": "wolfkden",
"postDate": "Tue Mar 8 22:56:57 PST 2011",
"text": "自動車メーカーで生産世界首位。国内販売で「プリウス」が好調も、世界的には大規模リコール問題での信頼回復が急務"

}'

curl -XGET 'http://localhost:9200/twitter_jp/tweet/_search'-d'{
"query" : {
"match_all" : { }
},
"facets" : {
"tag" : {
"terms" : {
"field" : "text",
"size" : 30
}
}
}

}'

the output provides counts for individual characters rather than
words:

"facets":{"tag":{"_type":"terms","missing":0,"terms":
[{"term":"車","count":2},{"term":"自","count":2},{"term":"国","count":2},
{"term":"動","count":2},{"term":"内","count":2},{"term":"ー","count":2},
{"term":"リ","count":2},{"term":"プ","count":2},{"term":"ス","count":2},
{"term":"コ","count":2},{"term":"ウ","count":2},{"term":"は","count":2},
{"term":"の","count":2},{"term":"で","count":2},{"term":"ｈｖ","count":1},
{"term":"首","count":1},{"term":"題","count":1},{"term":"頼","count":1},
{"term":"開","count":1},{"term":"販","count":1},{"term":"調","count":1},
{"term":"試","count":1},{"term":"規","count":1},{"term":"的","count":1},
{"term":"界","count":1},{"term":"産","count":1},{"term":"生","count":1},
{"term":"模","count":1},{"term":"急","count":1},{"term":"復","count":1},
{"term":"好","count":1},{"term":"大","count":1},{"term":"売","count":1},
{"term":"型","count":1},{"term":"回","count":1},{"term":"問","count":1},
{"term":"務","count":1},{"term":"初","count":1},{"term":"公","count":1},
{"term":"信","count":1}]}}}

In fact the word プリウス (Prius for toyota Prius) occurs in one of the
documents. I only expect the sen software to pick this prius test out
but the cjk analyzer should tokenize the two character words
correctly.

I checked the elasticsearch code and noted that the cjk analyzer type
is instanced through the CjkAnalysisProvider in the
AnalysisModule.processAnalyzer method. In the IndicesAnalysisService
the cjk setting loads the ChineseAnalyzer rather than the CJKAnalyzer,
I thought this was a problem in the code and made the correction but I
still have the same problem.

All in all though elasticsearch still does not tokenize east asian
words correctly.

Presently my efforts are focusing on the fact that the token stream
from the analyzer is not used and hence the CjK analyzer is not
implemented although the bindings are set.

Could I get some more insight into this to solve my problem.

Regards,

Wolf

Wolf · March 11, 2011, 9:40am

Thank you cwho, you are most excellent.

I tried the mod you suggested and I got the same result in version 0.15.2.

I will also follow your lead on adding new custom analyzers from previous posts.

Regards,

Wolf

Wolf_2 · March 11, 2011, 9:42am

Thank you cwho, you are most excellent.

I tried the mod you suggested and I got the same result in version
0.15.2.

I will also follow your lead on adding new custom analyzers from
previous posts.

Regards,

Wolf

On Mar 10, 10:03 pm, cwho80 fuzzyb...@gmail.com wrote:

The prebuilt "cjk" analyzer does indeed use ChineseAnalyzer instead of
CJKAnalyzer.

Instead of using the prebuilt "cjk", try defining a new analyzer
"my_cjk" in your configuration pointing to CjkAnalyzerProvider
instead:

index:
analysis:
analyzer:
my_cjk:
type:
org.elasticsearch.index.analysis.CjkAnalyzerProvider

Then following your example, it appears to be working for me - now
doing the bigram character words that the CJKAnalyzer uses. I'm using
0.15.1.

curl -XPUT 'http://localhost:9200/twitter_jp'-d '{
"index" : {
"numberOfShards" : 1,
"numberOfReplicas" : 1
}

}'

curl -XPUT 'http://localhost:9200/twitter_jp/_mapping'-d '{
"tweet": {
"properties": {
"user":{"type":"string"},
"postDate":{"type":"string"},
"text": { "type": "string", "index": "analyzed",
"analyzer" : "my_cjk" }
}
}

}'

curl -XPUT 'http://localhost:9200/twitter_jp/tweet/1'-d '{
"user": "wolfkden",
"postDate": "Tue Mar 8 22:56:57 PST 2011",
"text": "トヨタ自動車は、ハイブリッド車（ＨＶ）「プリウス」のワゴン型試作車「プリウス・スペースコンセプト」を国内で初めて公開
した。"

}'

curl -XGET 'http://localhost:9200/twitter_jp/tweet/_search'-d '{
"query" : {
"match_all" : { }
},
"facets" : {
"tag" : {
"terms" : {
"field" : "text",
"size" : 30
}
}
}

}'

{"took":182,"timed_out":false,"_shards":{"total":1,"successful":
1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":
[{"_index":"twitter_jp","_type":"tweet","_id":"1","_score":1.0,
"_source" : {
"user": "wolfkden",
"postDate": "Tue Mar 8 22:56:57 PST 2011",
"text": "トヨタ自動車は、ハイブリッド車（ＨＶ）「プリウス」のワゴン型試作車「プリウス・スペースコンセプト」を国内で初めて公開
した。"}}]},"facets":{"tag":{"_type":"terms","missing":0,"terms":[{"term":"開

し","count":1},{"term":"車は","count":1},{"term":"試作","count":1},
{"term":"自動","count":1},{"term":"型試","count":1},{"term":"国内","count":
1},{"term":"動車","count":1},{"term":"初め","count":1},{"term":"内
で","count":1},{"term":"公開","count":1},{"term":"作車","count":1},
{"term":"ース","count":1},{"term":"ン型","count":1},{"term":"ンセ","count":
1},{"term":"ワゴ","count":1},{"term":"リッ","count":1},{"term":"リ
ウ","count":1},{"term":"ヨタ","count":1},{"term":"ペー","count":1},
{"term":"プリ","count":1},{"term":"プト","count":1},{"term":"ブリ","count":
1},{"term":"ハイ","count":1},{"term":"ド車","count":1},{"term":"ト
ヨ","count":1},{"term":"ッド","count":1},{"term":"タ自","count":1},
{"term":"セプ","count":1},{"term":"スペ","count":1},{"term":"スコ","count":
1}]}}}

Hope this helps.

-C

On Mar 11, 3:53 am, Wolf wolfk...@gmail.com wrote:

Thank you Shay for your quick response and personal consideration.

I added more specif analysis mapping for the text field but problems
persist.

I noted your comment on the mapping and followed up with a mapping
declaration after the index declaration:

curl -XPUT 'http://localhost:9200/twitter_jp/_mapping'-d'{
"tweet": {
"properties": {
"user":{"type":"string"},
"postDate":{"type":"string"},
"text": { "type": "string", "index": "analyzed",
"analyzer" :
"cjk" }
}
}

}'

I also executed the text field mapping with "search_analyzer" : "cjk"
and
with
"index": "not_analyzed", "search_analyzer": "cjk"

switching off the indexing to not_analyzed caused no tokenization to
occur
with the facet response being two terms one for each document. The
interpreter is reading the cjk property since if I make a mispelling I
get a
guice error.

these results used version 0.15.2.

I also compiled a new version of Elasticsearch to use the CJKAnalyzer
for
binding against the cjk parameter in the IndicesAnalysisService class.
This
version produced the same results "I used guice comments to verify
that the
binding code executed".

I am still concerned that I cannot find where the
CJKAnalyzer.tokenStream is
referenced in the code. Possibly the default "standard" tokenizer is
being
used.

Please let me know what else is possibly wrong with my scripts or
other
remedies that I can pursue.

Regards,

Wolf

On Mar 9, 8:00 pm, Wolf wolfk...@gmail.com wrote:

I am using the CJK Analyzer to index Japanese language content into
elasticsearch: However the the input is not tokenized into words but
rather characters. The CJK analyzer does not have this problem when
using a stand alone lucene demonstration program, pp. 147, "Lucene in
Action" second edition, Manning 2010. I also do not have any problems
with Sen in an independent Java demonstration. The following curl
inputs demonstrate my problem.

curl -XPUT 'http://localhost:9200/twitter_jp'-d'{
"index" : {
"numberOfShards" : 1,
"numberOfReplicas" : 1,
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"type" : "cjk"
}
}
}
}

}'

curl -XPUT 'http://localhost:9200/twitter_jp/tweet/1'-d'{
"user": "wolfkden",
"postDate": "Tue Mar 8 22:56:57 PST 2011",
"text": "トヨタ自動車は、ハイブリッド車（ＨＶ）「プリウス」のワゴン型試作車「プリウス・スペースコンセプト」を国内で初めて公開
した。"

}'

curl -XPUT 'http://localhost:9200/twitter_jp/tweet/2'-d'{
"user": "wolfkden",
"postDate": "Tue Mar 8 22:56:57 PST 2011",
"text": "自動車メーカーで生産世界首位。国内販売で「プリウス」が好調も、世界的には大規模リコール問題での信頼回復が急務"

}'

curl -XGET 'http://localhost:9200/twitter_jp/tweet/_search'-d'{
"query" : {
"match_all" : { }
},
"facets" : {
"tag" : {
"terms" : {
"field" : "text",
"size" : 30
}
}
}

}'

the output provides counts for individual characters rather than
words:

"facets":{"tag":{"_type":"terms","missing":0,"terms":
[{"term":"車","count":2},{"term":"自","count":2},{"term":"国","count":2},
{"term":"動","count":2},{"term":"内","count":2},{"term":"ー","count":2},
{"term":"リ","count":2},{"term":"プ","count":2},{"term":"ス","count":2},
{"term":"コ","count":2},{"term":"ウ","count":2},{"term":"は","count":2},
{"term":"の","count":2},{"term":"で","count":2},{"term":"ｈｖ","count":1},
{"term":"首","count":1},{"term":"題","count":1},{"term":"頼","count":1},
{"term":"開","count":1},{"term":"販","count":1},{"term":"調","count":1},
{"term":"試","count":1},{"term":"規","count":1},{"term":"的","count":1},
{"term":"界","count":1},{"term":"産","count":1},{"term":"生","count":1},
{"term":"模","count":1},{"term":"急","count":1},{"term":"復","count":1},
{"term":"好","count":1},{"term":"大","count":1},{"term":"売","count":1},
{"term":"型","count":1},{"term":"回","count":1},{"term":"問","count":1},
{"term":"務","count":1},{"term":"初","count":1},{"term":"公","count":1},
{"term":"信","count":1}]}}}

In fact the word プリウス (Prius for toyota Prius) occurs in one of the
documents. I only expect the sen software to pick this prius test out
but the cjk analyzer should tokenize the two character words
correctly.

I checked the elasticsearch code and noted that the cjk analyzer type
is instanced through the CjkAnalysisProvider in the
AnalysisModule.processAnalyzer method. In the IndicesAnalysisService
the cjk setting loads the ChineseAnalyzer rather than the CJKAnalyzer,
I thought this was a problem in the code and made the correction but I
still have the same problem.

All in all though elasticsearch still does not tokenize east asian
words correctly.

Presently my efforts are focusing on the fact that the token stream
from the analyzer is not used and hence the CjK analyzer is not
implemented although the bindings are set.

Could I get some more insight into this to solve my problem.

Regards,

Wolf

Topic		Replies	Views
Dumb question- using the cjk analyzer Elasticsearch	3	669	July 6, 2017
Chinese Language Analyzer or CJK Elasticsearch	1	395	July 6, 2017
How to use my customer lucene analyzer(tokenizer)? Elasticsearch	6	1087	July 6, 2017
Cjk and thai analyzer customization Elasticsearch	4	714	July 6, 2017
Lang (czech) analyzer with asciifolding tokenizer or icu_tokenizer Elasticsearch	10	1211	July 6, 2017

Asian characters and not words are tokenized - CJK Analysis and Tokenization Problems

Related topics