Token Filter: catenate_numbers - spaces included?

doaks · May 16, 2019, 9:04am

According here Word delimiter token filter | Elasticsearch Guide [8.11] | Elastic

catenate_numbers
If true causes maximum runs of number parts to be catenated: > "500-42" ⇒ "50042". Defaults to false ."

I would like to know if there is a way to include \s+ (any number of spaces) with '-' as characters to collapse when catenating number strings.

Phone numbers are often written like 07 9833-4266 and it would be good if that could be collapsed to a single string 0798334266.

Is there a way?

luiz.santos · May 24, 2019, 2:00pm

Hi @doaks,

It seems catenate_all would work for you. Please, find the example below:

GET _analyze
{
  "text": [
    "07 9833-4266",
    "+1 (407) 284-1234"
  ],
  "tokenizer": "keyword",
  "filter": [
    {
      "type": "word_delimiter",
      "catenate_all": true,
      "split_on_case_change": false,
      "split_on_numerics": false,
      "generate_word_parts": false,
      "generate_number_parts": false
    }
  ]
}

The result is:

{
  "tokens" : [
    {
      "token" : "0798334266",
      "start_offset" : 0,
      "end_offset" : 12,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "14072841234",
      "start_offset" : 14,
      "end_offset" : 30,
      "type" : "word",
      "position" : 1
    }
  ]
}

I hope it helps!

system · June 21, 2019, 2:00pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Catenate_numbers - way to avoid joining on space and comma? Elasticsearch	1	243	December 7, 2021
Issue with using word delimiter filter Elasticsearch	5	569	July 6, 2017
Issue with using word delimiter Elasticsearch	1	605	July 6, 2017
Word_delimiter and catenate_all doesnt work? Elasticsearch	8	1663	July 5, 2017
Seeking advice on custom analyzer to catenate word with following letters Elasticsearch	2	446	January 17, 2017

Token Filter: catenate_numbers - spaces included?

Related topics