Normalizing numbers in the analysis chain
A common question in full text search is how to deal with numbers. In the base case you could fully extract them and query them as real numbers within a range, but in many cases this would require a lot of analysis and often numbers are just part of a full text search like iphone 17 or bed 1.4 m.
The problem is, that users may think differently about numbers than you, when typing a search.
Is 1.4 m and 1,4 m the same? Turns out that US and Europe use different splitting characters for splitting big numbers and fractions. On top of that dots and commas are used interchangeably when users type into a search engine, especially when the numbers are small.
Is 007 and 7 the same? Depends on your use-case.
Is 1.4 m and 1.40 m the same? Depends... you get my point.
So what can we do to normalize numbers a little bit?
For the sake of this example, let's take a look at numbers by dropping everything else in our analysis chain using the keep_types token filter:
POST _analyze
{
"text": "makita führungsschiene 1.4 m, 1,4 m 1,40 1.40",
"tokenizer": "standard",
"filter": [
{
"type": "keep_types",
"types": [
"<NUM>"
]
}
]
}
This only returns the tokens that look like a number, no matter if they are including a dot or a comma, but excludes everything else like regular words, i.e. makita.
Let's start with unifying all numbers having a dot or a comma.
POST _analyze
{
"text": "makita führungsschiene 1.4 m, 1,4 m 1,40 1.40",
"tokenizer": "standard",
"filter": [
{
"type": "keep_types",
"types": [
"<NUM>"
]
},
{
"type": "pattern_replace",
"pattern": "(\\d+)\\,(\\d+)",
"replacement" : "$1.$2"
}
]
}
This returns only 1.4 or 1.40 - nice! So no matter what is indexed, or what the user is searching for, now we will always assume a number has a dot with the help of the pattern_replace token filter.
If you don't care for positions, you could go with a unique token filter at the end - and of course omit norms to reduce index size.
You could also remove the dot and just keep the number itself, but that would mean searching for 1.7 could return an iphone 17 - again it depends, if that is wanted behaviour.
Next up, let's get rid of leading zeros:
POST _analyze
{
"text": "test 007 7 700 000 0",
"tokenizer": "standard",
"filter": [
{
"type": "keep_types",
"types": [
"<NUM>"
]
},
{
"type": "pattern_replace",
"pattern": "^0+(\\d+)",
"replacement" : "$1"
}
]
}
So now, 007 or 000 gets reduced to a single digit character. While this may be useful, keep in mind, that you also may increase ambiguity, when a user is searching for 007 as a part number and gets back everything that includes 7.
Now, the true fun begins: removing trailing zeros, without going crazy. As usual, if everything you have is a regex, you going to come up with a fancy regex, but maybe preprocessing might have been a good idea already.
POST _analyze
{
"text": "0.100 0.1000 0.101 100 100.0 100.00 100.001",
"tokenizer": "standard",
"filter": [
{
"type": "keep_types",
"types": [
"<NUM>"
]
},
{
"type": "pattern_replace",
"pattern": "^(\\d+)\\.([0-9])(0+)$",
"replacement" : "$1.$2"
}
]
}
This returns (at least if you add filter_path=**.token to your request):
{
"tokens": [
{
"token": "0.1"
},
{
"token": "0.1"
},
{
"token": "0.101"
},
{
"token": "100"
},
{
"token": "100.0"
},
{
"token": "100.0"
},
{
"token": "100.001"
}
]
}
You already see some further things to work on here. Is there really a difference between 100 and 100.0, maybe you could remove that trailing .0 completely in such cases. And I am sure you will come up with a wonderful regex for that.
Let's put everything together:
POST _analyze
{
"text": "makita führungsschiene 1.4 m, 1,4 m 1,40 1.40 1.0 1.00 0.100 0.1000 0.101 0.1010 100 100.0 100.00 100.001 0.100 007 700",
"tokenizer": "standard",
"filter": [
{
"type": "keep_types",
"types": [
"<NUM>"
]
},
{
"type": "pattern_replace",
"pattern": "(\\d+)\\,(\\d+)",
"replacement" : "$1.$2"
},
{
"type": "pattern_replace",
"pattern": "^0+(\\d+)",
"replacement" : "$1"
},
{
"type": "pattern_replace",
"pattern": "^(\\d+)\\.([0-9])(0+)$",
"replacement" : "$1.$2"
}
]
}
In a real analysis chain you would probably drop the keep_types filter and maybe try to group regular expressions together for speed if applicable, but this is probably a good start.
If you look closely at the output above, you'll notice there is another slight bug: 0.1010 does not get reduced to 0.101. So you may need to have another fix on your regular expression to make this work - keep in mind, it's also OK to have an additional token filter if it helps readability ![]()
One more implementation hint. If you want to make sure, that your pattern replace filters are only running against numbers you can use the condition token filter for that.
