Continuing the discussion from ElasticSearch standard Analyzer - problematic case :
[quote="Igor_Romanov, post:1, topic:16782"]
Hi
I have read this article but no one has replied on it.
I am facing same issue.please provide solution.
I was analyzing some analyzer weird behavior, and try to understand why it
happens and how to fix it
here what token I get for standard analyzer for text:
"myemail@email.com :test1234"
curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty=true' -d
'myemail@email.com:test1234'
{
"tokens" : [ {
"token" : "myemail",
"start_offset" : 0,
"end_offset" : 7,
"type" : "",
"position" : 1
}, {
"token" : "email.com:test1234",
"start_offset" : 8,
"end_offset" : 26,
"type" : "",
"position" : 2
} ]
}
so question is why I am getting that as one token: "email.com :test1234"
why it is not divided to tokens by . : _ and ?
Standard tokenizer is not working with this four keywords.
can any one help me out on this issue?
Is it BUG or not?
Thanks,
sanjay
Hi Sanjay,
What do you want to happen in this case? What are the search requirements? Can you give some examples of queries you would expect to match this email address in the text, and other which would you expect not to match it?
Tom
I just want to tokenize it with standard analyzer.
when I add any text like "test.elastic wirth test:lucene" it should tokenized with dot and colon char as well as it's not doing.
so tokens are
test
elastic
lucene
The Simple Analyzer does what you want, I think?
curl "localhost:9200/_analyze?analyzer=simple&text=myemail@email.com:test1234&pretty"
{
"tokens" : [
{
"token" : "myemail",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
},
{
"token" : "email",
"start_offset" : 8,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "com",
"start_offset" : 14,
"end_offset" : 17,
"type" : "word",
"position" : 2
},
{
"token" : "test",
"start_offset" : 18,
"end_offset" : 22,
"type" : "word",
"position" : 3
}
]
}
Yes simple analyzer is one only for english language it not working in other language like gujarati,hindi,urdu,french etc..
so I can not remove standard analyzer I just want to tokenize with those char
thanks,
Not Yet just finding reason for this.
do you know why this is happening with this 4 chars.
I guess because the designers of the standard analyzer decided to preserve e.g. domain names as single terms?
Text analysis in the real world is often a compromise and should be driven by your search requirements.
no It's not working if I insert text "something.is.missing" .
It will take it as whole word.
It's not breaking word.
system
(system)
Closed
January 10, 2018, 11:40am
11
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.