Analyzer for numbers with commas?

jakehschwartz · May 7, 2024, 7:33pm

Version 7.10

A bug with our analyzer was recently found dealing with commas in numbers.

Imagine a string like This car has like 10,000 HP, it can go very fast

We have a custom analyzer called full_text, will attach all the information below

We would like the search to be able to match the document when using match_phrase queries for both 10,000 HP and 10000 HP. I understand why it's happening, the _analyze endpoint is clear that 10000 is in position 4 but HP is in position 6

{
  "tokens": [
    ...
    {
      "token": "10,000",
      "start_offset": 18,
      "end_offset": 24,
      "type": "word",
      "position": 4
    },
    {
      "token": "10",
      "start_offset": 18,
      "end_offset": 20,
      "type": "word",
      "position": 4
    },
    {
      "token": "10000",
      "start_offset": 18,
      "end_offset": 24,
      "type": "word",
      "position": 4
    },
    {
      "token": "000",
      "start_offset": 21,
      "end_offset": 24,
      "type": "word",
      "position": 5
    },
    {
      "token": "hp,",
      "start_offset": 25,
      "end_offset": 28,
      "type": "word",
      "position": 6
    },
    ... 
  ]
}

I'm stuck on how I can improve the analyzer to fit the edge case. Thanks for the help and let me know if there's more information that I can provide to help.

Full analyzer definition

{
  "analysis": {
    "filter": {
      "word_delimiter_full_text": {
        "split_on_numerics": "false",
        "preserve_original": "true",
        "catenate_words": "true",
        "catenate_all": "true",
        "split_on_case_change": "false",
        "type": "word_delimiter",
        "type_table": [
          "# => ALPHA",
          "@ => ALPHA",
          "& => ALPHA",
          "_ => ALPHANUM",
          "$ => ALPHANUM"
        ],
        "catenate_numbers": "true"
      }
    },
    "char_filter": {
      "mapping_char_filter": {
        "type": "mapping",
        "mappings": [
          "=>'",
          "=>'",
          "‘=>'",
          "’=>'",
          "‛=>'",
          "ŉ=>'",
          "′=>'",
          "՚=>'",
          "՛=>'",
          "´=>'",
          "᾿=>'",
          "ʹ=>'",
          "ˊ=>'",
          "ʼ=>'",
          "=>",
          "“=>\"",
          "”=>\""
        ]
      }
    },
    "analyzer": {
      "full_text": {
        "filter": [
          "word_delimiter_full_text",
          "lowercase"
        ],
        "char_filter": [
          "html_strip",
          "mapping_char_filter"
        ],
        "tokenizer": "full_text_tokenizer"
      }
    },
    "tokenizer": {
      "full_text_tokenizer": {
        "type": "char_group",
        "tokenize_on_chars": [
          "whitespace",
          " ",
          " ",
          " ",
          " "
        ]
      }
    }
  }
}

Topic		Replies	Views
Problems with Tokenization Elasticsearch	3	646	October 26, 2017
Preventing phrase search from matching across sentence boundaries Elasticsearch	2	508	July 6, 2017
Help needed understanding analyzer behavior Elasticsearch	4	411	July 6, 2017
Match with or without thousands separator Elasticsearch	3	377	November 17, 2021
A number followed by a dot is considered a word break? Elasticsearch	5	12	November 4, 2024

Analyzer for numbers with commas?

Related topics