ElasticSearch standard Analyzer - exceptional case

sanjaykanani · December 13, 2017, 10:23am

Continuing the discussion from ElasticSearch standard Analyzer - problematic case:

[quote="Igor_Romanov, post:1, topic:16782"]

Hi

I have read this article but no one has replied on it.
I am facing same issue.please provide solution.

I was analyzing some analyzer weird behavior, and try to understand why it

happens and how to fix it

here what token I get for standard analyzer for text:

"myemail@email.com:test1234"

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty=true' -d 

'myemail@email.com:test1234'

{

  "tokens" : [ {

    "token" : "myemail",

    "start_offset" : 0,

    "end_offset" : 7,

    "type" : "",

    "position" : 1

  }, {

    "token" : "email.com:test1234",

    "start_offset" : 8,

    "end_offset" : 26,

    "type" : "",

    "position" : 2

  } ]

}

so question is why I am getting that as one token: "email.com:test1234"

why it is not divided to tokens by . : _ and ?

Standard tokenizer is not working with this four keywords.

can any one help me out on this issue?
Is it BUG or not?

Thanks,
sanjay

Tom_Mortimer · December 13, 2017, 10:41am

Hi Sanjay,

What do you want to happen in this case? What are the search requirements? Can you give some examples of queries you would expect to match this email address in the text, and other which would you expect not to match it?

Tom

sanjaykanani · December 13, 2017, 10:52am

I just want to tokenize it with standard analyzer.
when I add any text like "test.elastic wirth test:lucene" it should tokenized with dot and colon char as well as it's not doing.

so tokens are

test
elastic
lucene

Tom_Mortimer · December 13, 2017, 10:59am

The Simple Analyzer does what you want, I think?

curl "localhost:9200/_analyze?analyzer=simple&text=myemail@email.com:test1234&pretty"
{
  "tokens" : [
    {
      "token" : "myemail",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "email",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "com",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "test",
      "start_offset" : 18,
      "end_offset" : 22,
      "type" : "word",
      "position" : 3
    }
  ]
}

sanjaykanani · December 13, 2017, 11:03am

Yes simple analyzer is one only for english language it not working in other language like gujarati,hindi,urdu,french etc..

so I can not remove standard analyzer I just want to tokenize with those char

thanks,

Tom_Mortimer · December 13, 2017, 11:04am

Have you tried using a regular expression?

https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pattern-analyzer.html

sanjaykanani · December 13, 2017, 11:08am

Not Yet just finding reason for this.
do you know why this is happening with this 4 chars.

Tom_Mortimer · December 13, 2017, 11:18am

I guess because the designers of the standard analyzer decided to preserve e.g. domain names as single terms?

Text analysis in the real world is often a compromise and should be driven by your search requirements.

sanjaykanani · December 13, 2017, 11:21am

no It's not working if I insert text "something.is.missing" .
It will take it as whole word.
It's not breaking word.

sanjaykanani · December 13, 2017, 11:40am

hi,
Please check below link .
Same solution we need in elastic search.
https://stackoverflow.com/questions/15235126/lucene-4-1-how-split-words-that-contains-dots-when-indexing

system · January 10, 2018, 11:40am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ElasticSearch standard Analyzer - problematic case Elasticsearch	2	551	July 6, 2017
Standard tokenizer documentation doesn't match behavior Elasticsearch	2	318	July 6, 2017
Indexing and searching for string '?' Elasticsearch	2	322	July 6, 2017
Problems with Tokenization Elasticsearch	3	654	October 26, 2017
Email Analyzer failing in 0.16.0 Elasticsearch	2	288	July 6, 2017

ElasticSearch standard Analyzer - exceptional case

Related topics