A number followed by a dot is considered a word break?

I am looking at very strange (to me at least) behavior and am unable to find clear documentation for it.

GET _analyze
{
  "analyzer": "standard",
  "text": "server.mycopany.com"
}

Produces

{
  "tokens": [
    {
      "token": "server.mycopany.com",
      "start_offset": 0,
      "end_offset": 19,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

So far so good. I do not expect the standard tokenzier to break on dots without w/s.

But when a number creeps in before a dot:

GET _analyze
{
  "analyzer": "standard",
  "text": "server1.mycopany.com"
}

This happens:

{
  "tokens": [
    {
      "token": "server1",
      "start_offset": 0,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "mycopany.com",
      "start_offset": 8,
      "end_offset": 20,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}
  1. Why is this happening?
  2. How to overcome this? ==> I dont want a number followed by a dot to break the word.

Yes that's the expected behavior.
Note that if you analyze server1. mycopany., it will produces 2 tokens server1 and mycopany.

If you want to keep everything as is, you need to use a keyword tokenizer.

From Elastic Search to Elasticsearch

You can also try creating a custom analyser that handles this the way you want, e.g. first use a pattern replace token filter to replace any full stops (maybe also commas and other special characters?) followed by whitespace with just a white space and then apply a whitespace tokenizer. That should solve the issue you are seeing now but may require some tweaking to handle other scenarios/edge cases.

I see. Thanks!

According to the spec, dot delimited letter sequences (abra.kadabra) or number sequences (74.75) are not broken down.

But if I have abra1.kadabra, this is no longer considered a letter sequence.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.