A number followed by a dot is considered a word break?

Hrusha · October 7, 2024, 9:04am

I am looking at very strange (to me at least) behavior and am unable to find clear documentation for it.

GET _analyze
{
  "analyzer": "standard",
  "text": "server.mycopany.com"
}

Produces

{
  "tokens": [
    {
      "token": "server.mycopany.com",
      "start_offset": 0,
      "end_offset": 19,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

So far so good. I do not expect the standard tokenzier to break on dots without w/s.

But when a number creeps in before a dot:

GET _analyze
{
  "analyzer": "standard",
  "text": "server1.mycopany.com"
}

This happens:

{
  "tokens": [
    {
      "token": "server1",
      "start_offset": 0,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "mycopany.com",
      "start_offset": 8,
      "end_offset": 20,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

Why is this happening?
How to overcome this? ==> I dont want a number followed by a dot to break the word.

dadoonet · October 7, 2024, 10:24am

Yes that's the expected behavior.
Note that if you analyze server1. mycopany., it will produces 2 tokens server1 and mycopany.

If you want to keep everything as is, you need to use a keyword tokenizer.

dadoonet · October 7, 2024, 10:24am

From Elastic Search to Elasticsearch

Christian_Dahlqvist · October 7, 2024, 10:35am

You can also try creating a custom analyser that handles this the way you want, e.g. first use a pattern replace token filter to replace any full stops (maybe also commas and other special characters?) followed by whitespace with just a white space and then apply a whitespace tokenizer. That should solve the issue you are seeing now but may require some tweaking to handle other scenarios/edge cases.

Hrusha · October 7, 2024, 3:47pm

I see. Thanks!

According to the spec, dot delimited letter sequences (abra.kadabra) or number sequences (74.75) are not broken down.

But if I have abra1.kadabra, this is no longer considered a letter sequence.

system · November 4, 2024, 3:48pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
'dot' analyzer Elasticsearch	2	1455	July 6, 2017
Problems with Tokenization Elasticsearch	3	646	October 26, 2017
Configuring the standard tokenizer elasticsearch Elasticsearch	2	449	October 30, 2018
Need an analyzer that can split words by dot and reserve numbers Elasticsearch	2	2930	August 31, 2017
Custom Token Filter: selectively remove .(dot) at end of word or catenate letter around .(dot) Elasticsearch	2	1637	July 6, 2017

A number followed by a dot is considered a word break?

Related topics