Version 7.10
A bug with our analyzer was recently found dealing with commas in numbers.
Imagine a string like This car has like 10,000 HP, it can go very fast
We have a custom analyzer called full_text, will attach all the information below
We would like the search to be able to match the document when using match_phrase queries for both 10,000 HP
and 10000 HP
. I understand why it's happening, the _analyze endpoint is clear that 10000 is in position 4 but HP is in position 6
{
"tokens": [
...
{
"token": "10,000",
"start_offset": 18,
"end_offset": 24,
"type": "word",
"position": 4
},
{
"token": "10",
"start_offset": 18,
"end_offset": 20,
"type": "word",
"position": 4
},
{
"token": "10000",
"start_offset": 18,
"end_offset": 24,
"type": "word",
"position": 4
},
{
"token": "000",
"start_offset": 21,
"end_offset": 24,
"type": "word",
"position": 5
},
{
"token": "hp,",
"start_offset": 25,
"end_offset": 28,
"type": "word",
"position": 6
},
...
]
}
I'm stuck on how I can improve the analyzer to fit the edge case. Thanks for the help and let me know if there's more information that I can provide to help.
Full analyzer definition
{
"analysis": {
"filter": {
"word_delimiter_full_text": {
"split_on_numerics": "false",
"preserve_original": "true",
"catenate_words": "true",
"catenate_all": "true",
"split_on_case_change": "false",
"type": "word_delimiter",
"type_table": [
"# => ALPHA",
"@ => ALPHA",
"& => ALPHA",
"_ => ALPHANUM",
"$ => ALPHANUM"
],
"catenate_numbers": "true"
}
},
"char_filter": {
"mapping_char_filter": {
"type": "mapping",
"mappings": [
"=>'",
"=>'",
"‘=>'",
"’=>'",
"‛=>'",
"ʼn=>'",
"′=>'",
"՚=>'",
"՛=>'",
"´=>'",
"᾿=>'",
"ʹ=>'",
"ˊ=>'",
"ʼ=>'",
"=>",
"“=>\"",
"”=>\""
]
}
},
"analyzer": {
"full_text": {
"filter": [
"word_delimiter_full_text",
"lowercase"
],
"char_filter": [
"html_strip",
"mapping_char_filter"
],
"tokenizer": "full_text_tokenizer"
}
},
"tokenizer": {
"full_text_tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"whitespace",
" ",
" ",
" ",
" "
]
}
}
}
}