Reverse word order search (with shingles)

Hello dear Elasticsearch users,

Wanted to ask if anyone has a suggestion how to handle this situation.

We have source data with phrases like "mazda 3"

And two search scenarios:

  1. Customer should be able to find it via "mazda3"
  2. Customer should be able to find it via "3mazda"

So the first issue can be solved via shingles filter with separator being configured to empty space.

But I'm not sure what would be the pragmatic approach for the second scenario?

I've looked at reverse filter, but it seems to reverse the words, so reversing shingles would result in "3adzam" (am I right?)

Another option to reverse the data during indexing and add a field for it (but could complicate queries, also the size of index grows)

Any thoughts, ideas are highly appreciated.

Hi @RS232 and welcome to the community!

Vehicle information can be frustrating and so can users' intent while searching.

One way to handle this could be to use a Custom Simple Pattern tokenizer and use RegEx to break apart the term into alpha and numeric.

Something like:

PUT my-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "simple_pattern",
          "pattern": "([a-zA-Z]*)|([0-9]*)"
        }
      }
    }
  }
}

Then, when you analyze them:

POST my-index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "mazda 3"
}

POST my-index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "3mazda"
}

POST my-index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "mazda3"
}

All of those will token out to:

{
  "tokens": [
    {
      "token": "mazda",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "3",
      "start_offset": 6,
      "end_offset": 7,
      "type": "word",
      "position": 1
    }
  ]
}

Likewise, a 2023 Ford "F150", "F-150", and "F 150" all token out the same as well:

POST my-index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "2023 Ford F150"
}
POST my-index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "2023 Ford F-150"
}
{
  "tokens": [
    {
      "token": "2023",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "Ford",
      "start_offset": 5,
      "end_offset": 9,
      "type": "word",
      "position": 1
    },
    {
      "token": "F",
      "start_offset": 10,
      "end_offset": 11,
      "type": "word",
      "position": 2
    },
    {
      "token": "150",
      "start_offset": 12,
      "end_offset": 15,
      "type": "word",
      "position": 3
    }
  ]
}

You will definitely need to test something like this against a larger corpus of data and incoming search requests to verify.

On the other hand, depending on your data, it may make more sense to put the focus on preprocessing both at ingest and in the intermittent search api in order to perform your own NLP tasks. By that, I mean you can investigate the text and determine if you find any YEAR/MAKE/MODEL combinations in the text and pull those out as entities in order to perform a filter. In the logic, you would have to know that when you see MAZDA, you can take it as a MAKE and then see if you can identify a MODEL within the rest of the context.

This is then tricky because if I search for "mazda 3 in nerf bar", then you have to understand the difference between a "mazda 3" and/or "3 in nerf bar" (not to mention whether its a NERF bar, step bar, side bar, running board, etc.) :sweat_smile:

Thanks a lot for the welcome :hugs:

And thanks for the suggestion :+1: this also got me to find this filter - word_delimiter.

Overall I think I can use regex/word_delimiter to extract tokens (have a feeling that this should be done during query time, at least this what docs suggest)

This will probably work for mazda3/3mazda since we have a nice way to split it. :+1:

For something like

carDealer/dealercar
automechanic/mechanicauto

This will probably not work that great but I can add shingles filter at some point (to concatenate words, works for one direction, but I guess I'll have to settle for that for now)

Thanks again for the ideas Eddie!,
Great stuff :+1:

Yea. if your terms come in as Camel Case or Pascal Case, then the word delimiter will work well there too. However, all lowercase does present an issue. You almost need true word segmentation like this one in python.

Another option is to layer the filters with shingle and then pattern replace the whitespace in the original token that the shingle leaves you:

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "shingle"
    },
    {
      "type": "pattern_replace",
      "pattern": " ",
      "replacement": ""
    }
  ],
  "text": "auto mechanic"
}

Yields:

{
  "tokens": [
    {
      "token": "auto",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "automechanic",
      "start_offset": 0,
      "end_offset": 13,
      "type": "shingle",
      "position": 0,
      "positionLength": 2
    },
    {
      "token": "mechanic",
      "start_offset": 5,
      "end_offset": 13,
      "type": "word",
      "position": 1
    }
  ]
}
1 Like

Yup, thanks for the reply Eddie!

Good ideas indeed. Will probably have to shingle stuff.

Now one thing that bugs me (and i can't find the answer) - is there any way in Elasticsearch to generate direct and reverse shingles.

ex:
"mazda" "3" to generate
direct shingles: "mazda", "mazda3", "3"
reverse order shingles: "3", "3mazda", "mazda"

Not sure though if it is possible?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.