Token Filter: catenate_numbers - spaces included?

According here Word delimiter token filter | Elasticsearch Guide [8.11] | Elastic

catenate_numbers
If true causes maximum runs of number parts to be catenated: > "500-42" ⇒ "50042". Defaults to false ."

I would like to know if there is a way to include \s+ (any number of spaces) with '-' as characters to collapse when catenating number strings.

Phone numbers are often written like 07 9833-4266 and it would be good if that could be collapsed to a single string 0798334266.

Is there a way?

Hi @doaks,

It seems catenate_all would work for you. Please, find the example below:

GET _analyze
{
  "text": [
    "07 9833-4266",
    "+1 (407) 284-1234"
  ],
  "tokenizer": "keyword",
  "filter": [
    {
      "type": "word_delimiter",
      "catenate_all": true,
      "split_on_case_change": false,
      "split_on_numerics": false,
      "generate_word_parts": false,
      "generate_number_parts": false
    }
  ]
}

The result is:

{
  "tokens" : [
    {
      "token" : "0798334266",
      "start_offset" : 0,
      "end_offset" : 12,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "14072841234",
      "start_offset" : 14,
      "end_offset" : 30,
      "type" : "word",
      "position" : 1
    }
  ]
}

I hope it helps!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.