How to handle unit suffixes (e.g. 16GB vs 16 GB)?

I am using ElasticSearch for a webshop with a lot of products with values + unit suffixes in their name, such as the following:

  • Samsung 960 EVO 500 GB
  • Samsung 960 PRO 500GB

Is there a clever way to get ElasticSearch to match both of them when the user searches for "500GB" or "500 GB"? I have done a lot of googling and reading through most of the ElasticSearch documentation, but to no avail. :confused:
I was wondering if it could be a solution to add the typical RAM/HDD sizes as synonyms (i.e. "16GB" <=> "16 GB" etc.), but maybe there is an even smarter solution which can figure it out automatically by taking the term positions into account?

Hi,

Synonyms sound like a good start to me but its sounds like it doesn't generalize well.

I have a hunch that rather than expanding "500GB" to "500 GB" it would help to join two adjacent tokens if one is a number and the following is a common unit, otherwise you get a lot of search matches on things like "GB". Unfortunately the default tokenizers don't do this and there seems to be no out-of-the-box token filter who would look at adjacent tokens in a token stream to allow transformations.

One option to investigate might be to use a Shingle Token Filter, maybe with 2-grams on a separate multi-field, but I have a feeling this will require some tweaking until the results feel right. I think its worth trying though.

I experimented a bit with this, I recommend that you check out the word_delimiter token filter:

https://www.elastic.co/guide/en/elasticsearch/reference/6.0/analysis-word-delimiter-tokenfilter.html

Here's an example:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "whitespace",
          "filter": ["my_word_delim"]
        }
      },
      "filter": {
        "my_word_delim": {
          "type": "word_delimiter"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Samsung 960 PRO 500GB"
}

Which, when I ran it, gives:

{
  "tokens" : [
    {
      "token" : "Samsung",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "960",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "PRO",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "500",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "GB",
      "start_offset" : 19,
      "end_offset" : 21,
      "type" : "word",
      "position" : 4
    }
  ]
}

Notice that "500GB" was split into "500" and "GB". You should be able to use a match_phrase query for "500 GB" then (combine with lowercase, etc)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.