ElasticSearch standard Analyzer - problematic case

Igor_Romanov · April 3, 2014, 8:52am

Hi

I was analyzing some analyzer weird behaviour, and try to understand why it
happens and how to fix it

here what token I get for standard analyzer for text:
"myemail@email.com:test1234"

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty=true' -d
'myemail@email.com:test1234'
{
"tokens" : [ {
"token" : "myemail",
"start_offset" : 0,
"end_offset" : 7,
"type" : "",
"position" : 1
}, {
"token" : "email.com:test1234",
"start_offset" : 8,
"end_offset" : 26,
"type" : "",
"position" : 2
} ]
}

so question is why I am getting that as one token: "email.com:test1234"

why it is not devided to tokens by . and : ?

and what analyzer/tokenizer/filter can I use that can help with it?

Thanks,
Igor

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/826eb584-3408-404a-b87c-2c44e455bb65%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Robert_Muir_2 · April 3, 2014, 4:48pm

The standard analyzer doesn't really know anything about emails/URLs,
its just implementing the Unicode tokenization rules.

There is an extension of it that does know about these things (and
tries to keep them as one token)...

Maybe try this one and see if it works better for you?

On Thu, Apr 3, 2014 at 4:52 AM, Igor Romanov igorrom@gmail.com wrote:

Hi

I was analyzing some analyzer weird behaviour, and try to understand why it
happens and how to fix it

here what token I get for standard analyzer for text:
"myemail@email.com:test1234"

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty=true' -d
'myemail@email.com:test1234'
{
"tokens" : [ {
"token" : "myemail",
"start_offset" : 0,
"end_offset" : 7,
"type" : "",
"position" : 1
}, {
"token" : "email.com:test1234",
"start_offset" : 8,
"end_offset" : 26,
"type" : "",
"position" : 2
} ]
}

so question is why I am getting that as one token: "email.com:test1234"

why it is not devided to tokens by . and : ?

and what analyzer/tokenizer/filter can I use that can help with it?

Thanks,
Igor

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/826eb584-3408-404a-b87c-2c44e455bb65%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMUKNZWGsks9O5Y5qupAovgn6Vwa3EwVKju9WOeSmW3dQ-hPTA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
ElasticSearch standard Analyzer - exceptional case Elasticsearch	9	1105	December 13, 2017
Standard tokenizer documentation doesn't match behavior Elasticsearch	1	378	January 19, 2012
Configuring the standard tokenizer elasticsearch Elasticsearch	1	493	October 2, 2018
Problems with Tokenization Elasticsearch	2	708	September 28, 2017
Bug in standard analyzer? Elasticsearch	0	328	July 21, 2020

ElasticSearch standard Analyzer - problematic case

Related topics