ElasticSearch standard Analyzer - exceptional case

Continuing the discussion from ElasticSearch standard Analyzer - problematic case:

[quote="Igor_Romanov, post:1, topic:16782"]

Hi

I have read this article but no one has replied on it.
I am facing same issue.please provide solution.

I was analyzing some analyzer weird behavior, and try to understand why it

happens and how to fix it

here what token I get for standard analyzer for text:

"myemail@email.com:test1234"

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty=true' -d 

'myemail@email.com:test1234'

{

  "tokens" : [ {

    "token" : "myemail",

    "start_offset" : 0,

    "end_offset" : 7,

    "type" : "",

    "position" : 1

  }, {

    "token" : "email.com:test1234",

    "start_offset" : 8,

    "end_offset" : 26,

    "type" : "",

    "position" : 2

  } ]

}

so question is why I am getting that as one token: "email.com:test1234"

why it is not divided to tokens by . : _ and ?

Standard tokenizer is not working with this four keywords.

can any one help me out on this issue?
Is it BUG or not?

Thanks,
sanjay

Hi Sanjay,

What do you want to happen in this case? What are the search requirements? Can you give some examples of queries you would expect to match this email address in the text, and other which would you expect not to match it?

Tom

I just want to tokenize it with standard analyzer.
when I add any text like "test.elastic wirth test:lucene" it should tokenized with dot and colon char as well as it's not doing.

so tokens are

test
elastic
lucene

The Simple Analyzer does what you want, I think?

curl "localhost:9200/_analyze?analyzer=simple&text=myemail@email.com:test1234&pretty"
{
  "tokens" : [
    {
      "token" : "myemail",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "email",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "com",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "test",
      "start_offset" : 18,
      "end_offset" : 22,
      "type" : "word",
      "position" : 3
    }
  ]
}

Yes simple analyzer is one only for english language it not working in other language like gujarati,hindi,urdu,french etc..

so I can not remove standard analyzer I just want to tokenize with those char

thanks,

Have you tried using a regular expression?

https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pattern-analyzer.html

Not Yet just finding reason for this.
do you know why this is happening with this 4 chars.

I guess because the designers of the standard analyzer decided to preserve e.g. domain names as single terms?

Text analysis in the real world is often a compromise and should be driven by your search requirements.

no It's not working if I insert text "something.is.missing" .
It will take it as whole word.
It's not breaking word.

hi,
Please check below link .
Same solution we need in elastic search.
https://stackoverflow.com/questions/15235126/lucene-4-1-how-split-words-that-contains-dots-when-indexing

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.