ElasticSearch standard Analyzer - problematic case


(Igor Romanov) #1

Hi

I was analyzing some analyzer weird behaviour, and try to understand why it
happens and how to fix it

here what token I get for standard analyzer for text:
"myemail@email.com:test1234"

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty=true' -d
'myemail@email.com:test1234'
{
"tokens" : [ {
"token" : "myemail",
"start_offset" : 0,
"end_offset" : 7,
"type" : "",
"position" : 1
}, {
"token" : "email.com:test1234",
"start_offset" : 8,
"end_offset" : 26,
"type" : "",
"position" : 2
} ]
}

so question is why I am getting that as one token: "email.com:test1234"

why it is not devided to tokens by . and : ?

and what analyzer/tokenizer/filter can I use that can help with it?

Thanks,
Igor

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/826eb584-3408-404a-b87c-2c44e455bb65%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


ElasticSearch standard Analyzer - exceptional case
(Robert Muir-2) #2

The standard analyzer doesn't really know anything about emails/URLs,
its just implementing the Unicode tokenization rules.

There is an extension of it that does know about these things (and
tries to keep them as one token)...

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-uaxurlemail-tokenizer.html

Maybe try this one and see if it works better for you?

On Thu, Apr 3, 2014 at 4:52 AM, Igor Romanov igorrom@gmail.com wrote:

Hi

I was analyzing some analyzer weird behaviour, and try to understand why it
happens and how to fix it

here what token I get for standard analyzer for text:
"myemail@email.com:test1234"

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty=true' -d
'myemail@email.com:test1234'
{
"tokens" : [ {
"token" : "myemail",
"start_offset" : 0,
"end_offset" : 7,
"type" : "",
"position" : 1
}, {
"token" : "email.com:test1234",
"start_offset" : 8,
"end_offset" : 26,
"type" : "",
"position" : 2
} ]
}

so question is why I am getting that as one token: "email.com:test1234"

why it is not devided to tokens by . and : ?

and what analyzer/tokenizer/filter can I use that can help with it?

Thanks,
Igor

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/826eb584-3408-404a-b87c-2c44e455bb65%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMUKNZWGsks9O5Y5qupAovgn6Vwa3EwVKju9WOeSmW3dQ-hPTA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3