Hi @RS232 and welcome to the community!
Vehicle information can be frustrating and so can users' intent while searching.
One way to handle this could be to use a Custom Simple Pattern tokenizer and use RegEx to break apart the term into alpha and numeric.
Something like:
PUT my-index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "simple_pattern",
"pattern": "([a-zA-Z]*)|([0-9]*)"
}
}
}
}
}
Then, when you analyze them:
POST my-index/_analyze
{
"analyzer": "my_analyzer",
"text": "mazda 3"
}
POST my-index/_analyze
{
"analyzer": "my_analyzer",
"text": "3mazda"
}
POST my-index/_analyze
{
"analyzer": "my_analyzer",
"text": "mazda3"
}
All of those will token out to:
{
"tokens": [
{
"token": "mazda",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "3",
"start_offset": 6,
"end_offset": 7,
"type": "word",
"position": 1
}
]
}
Likewise, a 2023 Ford "F150", "F-150", and "F 150" all token out the same as well:
POST my-index/_analyze
{
"analyzer": "my_analyzer",
"text": "2023 Ford F150"
}
POST my-index/_analyze
{
"analyzer": "my_analyzer",
"text": "2023 Ford F-150"
}
{
"tokens": [
{
"token": "2023",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "Ford",
"start_offset": 5,
"end_offset": 9,
"type": "word",
"position": 1
},
{
"token": "F",
"start_offset": 10,
"end_offset": 11,
"type": "word",
"position": 2
},
{
"token": "150",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 3
}
]
}
You will definitely need to test something like this against a larger corpus of data and incoming search requests to verify.
On the other hand, depending on your data, it may make more sense to put the focus on preprocessing both at ingest and in the intermittent search api in order to perform your own NLP tasks. By that, I mean you can investigate the text and determine if you find any YEAR/MAKE/MODEL combinations in the text and pull those out as entities in order to perform a filter. In the logic, you would have to know that when you see MAZDA, you can take it as a MAKE and then see if you can identify a MODEL within the rest of the context.
This is then tricky because if I search for "mazda 3 in nerf bar", then you have to understand the difference between a "mazda 3" and/or "3 in nerf bar" (not to mention whether its a NERF bar, step bar, side bar, running board, etc.)