Shingle filter to allow mismatching spaces

awarrenlove · October 30, 2020, 5:02pm

I am trying to solve a problem where users sometimes include an extra space in their search terms, or alternatively a space was missing in the search term compared to what is in the index. In order to do this, I attempted to use the shingle filter with an empty separator so each pair of words is included as a token with the space between them removed. For example, if a field in the document is "some phrase" then the tokens will include "some", "somephrase", and "phrase", allowing the user to search for "somephrase" without the space and still match that document. However, I think I'm misunderstanding exactly how this filter works, as I'm not seeing the behavior I expect when I use a simple_query_string to match on this field.

I have the partial index mapping (other fields, analyzers, and filters stripped out for clarity)

{
  "settings": {
    ...,
    "analysis": {
      "analyzer": {
        ...,
        "company_name_analyzer_shingled": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "company_suffix_synonym_filter",
            "company_ownership_filter",
            "company_alias_filter",
            "shingle_filter"
          ],
          "char_filter": [
            "apostrophes"
          ]
        },
        ...
      },
      "filter": {
        ...,
        "company_alias_filter": {
          "type": "synonym",
          "synonyms": {
            "$ref": "../synonyms/company_aliases.json"
          }
        },
        "company_ownership_filter": {
          "type": "synonym",
          "synonyms": {
            "$ref": "../synonyms/company_ownership.json"
          }
        },
        "company_suffix_synonym_filter": {
          "type": "synonym",
          "synonyms": {
            "$ref": "../synonyms/company_suffixes.json"
          }
        },
        ...,
        "shingle_filter": {
          "type": "shingle",
          "token_separator": ""
        }
      },
      "char_filter": {
        ...,
        "apostrophes": {
          "type": "mapping",
          "mappings": [
            "\\u2018=>",
            "\\u2019=>",
            "\\u201B=>",
            "\\u0027=>"
          ]
        },
        ...
      }
    }
  },
  "mappings": {
    "product_root": {
      "properties": {
        ...,
        "products": {
          "type": "nested",
          "properties": {
            ...,
            "manufacturer": {
              "type": "object",
              "properties": {
                ...,
                "name": {
                  "type": "text",
                  "analyzer": "company_name_analyzer",
                  "fields": {
                    ...,
                    "shingled": {
                      "type": "text",
                      "analyzer": "company_name_analyzer_shingled"
                    }
                  }
                },
                ...
              }
            },
            ...
          }
        },
        ...
      }
    }
  }
}

One of the documents I indexed looks like

/srv # curl http://elasticsearch-master:9200/product_roots/product_root/1330748454?pretty
{
  ...
  "_source" : {
    ...,
    "products" : [
      {
        ...,
        "manufacturer" : {
          ...,
          "name" : "Celltreat",
          ...
        },
        ...
      }
    ],
    ...
  }
}

Running the relevant analyzer on that text gives me

/srv # curl --data '{"analyzer":"company_name_analyzer_shingled","text":"Celltreat"}' http://elasticsearch-master:9200/product_roots/_analyze?pretty
{
  "tokens" : [
    {
      "token" : "celltreat",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

However, running this search with an extra space does not return that document.

{
  "query": {
    "bool": {
      "must": {
        "nested": {
          "path": "products",
          "query": {
            "bool": {
              "must": {
                "simple_query_string": {
                  "fields": [
                    "products.manufacturer.name.shingled"
                  ],
                  "default_operator": "OR",
                  "flags": "OR|AND|NOT|PHRASE|PRECEDENCE|ESCAPE|WHITESPACE|FUZZY",
                  "lenient": true,
                  "query": "Cell treat"
                }
              }
            }
          }
        }
      }
    }
  }
}

Analyzing the query using the same analyzer gives these tokens.

/srv # curl --data '{"analyzer":"company_name_analyzer_shingled","text":"Cell treat"}' http://elasticsearch-master:9200/product_roots/_analyze?pretty
{
  "tokens" : [
    {
      "token" : "cell",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "celltreat",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "shingle",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "treat",
      "start_offset" : 5,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

Because they both contain the token "celltreat" I expected that query to match that field, but when I run it Elasticsearch returns no matching documents. I'm sure this is just something I'm misunderstanding about the shingle filter, but I'm hoping someone can point me to exactly what I'm misunderstanding about it so I can fix it.

Thank you.

Mark_Harwood · October 30, 2020, 6:03pm

Hi Andrew. See Query connected and seperated words (ie: everybody - every body)

awarrenlove · November 2, 2020, 1:45pm

Thanks, Mark. However, it's possible I'm just not experienced enough with this forum so I'm not seeing it, but I don't see where that post describes what I'm running into. I see that it describes the solution I have already attempted but doesn't appear to describe how to fix the problems with it that I have encountered. I also looked at the linked GitHub issue, but I'm using simple_query_string with a default OR operator rather than match_phrase with AND, so I think my issues is still slightly different.

If you have any further insight I would greatly appreciate it.

Mark_Harwood · November 2, 2020, 1:56pm

I took a look at simple query string docs,

The WHITESPACE flag in your request is causing the issue:

WHITESPACE
Enables whitespace as split characters.

awarrenlove · November 2, 2020, 2:15pm

Wow, now that you point it out it seems so obvious that I should have recognized that flag as being suspicious and investigated more deeply. Taking that off fixed my issue. Thank you!

system · November 30, 2020, 2:15pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Search with whitespace again Elasticsearch	3	5275	July 6, 2017
How does shingle filter work on match_phrase in query phase? Elasticsearch	5	1617	July 6, 2017
Analyzer, Fuzzy Query? Elasticsearch	7	2358	July 6, 2017
Shingle filter for sub phrase matching Elasticsearch	4	890	July 6, 2017
Problem with shingles as an autocomplete solution Elasticsearch	5	2591	July 6, 2017

Shingle filter to allow mismatching spaces

Related topics