ElasticSearch standard Analyzer - problematic case

Hi

I was analyzing some analyzer weird behaviour, and try to understand why it
happens and how to fix it

here what token I get for standard analyzer for text:
"myemail@email.com:test1234"

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty=true' -d
'myemail@email.com:test1234'
{
"tokens" : [ {
"token" : "myemail",
"start_offset" : 0,
"end_offset" : 7,
"type" : "",
"position" : 1
}, {
"token" : "email.com:test1234",
"start_offset" : 8,
"end_offset" : 26,
"type" : "",
"position" : 2
} ]
}

so question is why I am getting that as one token: "email.com:test1234"

why it is not devided to tokens by . and : ?

and what analyzer/tokenizer/filter can I use that can help with it?

Thanks,
Igor

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/826eb584-3408-404a-b87c-2c44e455bb65%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

The standard analyzer doesn't really know anything about emails/URLs,
its just implementing the Unicode tokenization rules.

There is an extension of it that does know about these things (and
tries to keep them as one token)...

Maybe try this one and see if it works better for you?

On Thu, Apr 3, 2014 at 4:52 AM, Igor Romanov igorrom@gmail.com wrote:

Hi

I was analyzing some analyzer weird behaviour, and try to understand why it
happens and how to fix it

here what token I get for standard analyzer for text:
"myemail@email.com:test1234"

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty=true' -d
'myemail@email.com:test1234'
{
"tokens" : [ {
"token" : "myemail",
"start_offset" : 0,
"end_offset" : 7,
"type" : "",
"position" : 1
}, {
"token" : "email.com:test1234",
"start_offset" : 8,
"end_offset" : 26,
"type" : "",
"position" : 2
} ]
}

so question is why I am getting that as one token: "email.com:test1234"

why it is not devided to tokens by . and : ?

and what analyzer/tokenizer/filter can I use that can help with it?

Thanks,
Igor

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/826eb584-3408-404a-b87c-2c44e455bb65%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMUKNZWGsks9O5Y5qupAovgn6Vwa3EwVKju9WOeSmW3dQ-hPTA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.