Shingle filter to allow mismatching spaces

I am trying to solve a problem where users sometimes include an extra space in their search terms, or alternatively a space was missing in the search term compared to what is in the index. In order to do this, I attempted to use the shingle filter with an empty separator so each pair of words is included as a token with the space between them removed. For example, if a field in the document is "some phrase" then the tokens will include "some", "somephrase", and "phrase", allowing the user to search for "somephrase" without the space and still match that document. However, I think I'm misunderstanding exactly how this filter works, as I'm not seeing the behavior I expect when I use a simple_query_string to match on this field.

I have the partial index mapping (other fields, analyzers, and filters stripped out for clarity)

{
  "settings": {
    ...,
    "analysis": {
      "analyzer": {
        ...,
        "company_name_analyzer_shingled": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "company_suffix_synonym_filter",
            "company_ownership_filter",
            "company_alias_filter",
            "shingle_filter"
          ],
          "char_filter": [
            "apostrophes"
          ]
        },
        ...
      },
      "filter": {
        ...,
        "company_alias_filter": {
          "type": "synonym",
          "synonyms": {
            "$ref": "../synonyms/company_aliases.json"
          }
        },
        "company_ownership_filter": {
          "type": "synonym",
          "synonyms": {
            "$ref": "../synonyms/company_ownership.json"
          }
        },
        "company_suffix_synonym_filter": {
          "type": "synonym",
          "synonyms": {
            "$ref": "../synonyms/company_suffixes.json"
          }
        },
        ...,
        "shingle_filter": {
          "type": "shingle",
          "token_separator": ""
        }
      },
      "char_filter": {
        ...,
        "apostrophes": {
          "type": "mapping",
          "mappings": [
            "\\u2018=>",
            "\\u2019=>",
            "\\u201B=>",
            "\\u0027=>"
          ]
        },
        ...
      }
    }
  },
  "mappings": {
    "product_root": {
      "properties": {
        ...,
        "products": {
          "type": "nested",
          "properties": {
            ...,
            "manufacturer": {
              "type": "object",
              "properties": {
                ...,
                "name": {
                  "type": "text",
                  "analyzer": "company_name_analyzer",
                  "fields": {
                    ...,
                    "shingled": {
                      "type": "text",
                      "analyzer": "company_name_analyzer_shingled"
                    }
                  }
                },
                ...
              }
            },
            ...
          }
        },
        ...
      }
    }
  }
}

One of the documents I indexed looks like

/srv # curl http://elasticsearch-master:9200/product_roots/product_root/1330748454?pretty
{
  ...
  "_source" : {
    ...,
    "products" : [
      {
        ...,
        "manufacturer" : {
          ...,
          "name" : "Celltreat",
          ...
        },
        ...
      }
    ],
    ...
  }
}

Running the relevant analyzer on that text gives me

/srv # curl --data '{"analyzer":"company_name_analyzer_shingled","text":"Celltreat"}' http://elasticsearch-master:9200/product_roots/_analyze?pretty
{
  "tokens" : [
    {
      "token" : "celltreat",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

However, running this search with an extra space does not return that document.

{
  "query": {
    "bool": {
      "must": {
        "nested": {
          "path": "products",
          "query": {
            "bool": {
              "must": {
                "simple_query_string": {
                  "fields": [
                    "products.manufacturer.name.shingled"
                  ],
                  "default_operator": "OR",
                  "flags": "OR|AND|NOT|PHRASE|PRECEDENCE|ESCAPE|WHITESPACE|FUZZY",
                  "lenient": true,
                  "query": "Cell treat"
                }
              }
            }
          }
        }
      }
    }
  }
}

Analyzing the query using the same analyzer gives these tokens.

/srv # curl --data '{"analyzer":"company_name_analyzer_shingled","text":"Cell treat"}' http://elasticsearch-master:9200/product_roots/_analyze?pretty
{
  "tokens" : [
    {
      "token" : "cell",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "celltreat",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "shingle",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "treat",
      "start_offset" : 5,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

Because they both contain the token "celltreat" I expected that query to match that field, but when I run it Elasticsearch returns no matching documents. I'm sure this is just something I'm misunderstanding about the shingle filter, but I'm hoping someone can point me to exactly what I'm misunderstanding about it so I can fix it.

Thank you.

Hi Andrew. See Query connected and seperated words (ie: everybody - every body)

Thanks, Mark. However, it's possible I'm just not experienced enough with this forum so I'm not seeing it, but I don't see where that post describes what I'm running into. I see that it describes the solution I have already attempted but doesn't appear to describe how to fix the problems with it that I have encountered. I also looked at the linked GitHub issue, but I'm using simple_query_string with a default OR operator rather than match_phrase with AND, so I think my issues is still slightly different.

If you have any further insight I would greatly appreciate it.

I took a look at simple query string docs,

The WHITESPACE flag in your request is causing the issue:

WHITESPACE
Enables whitespace as split characters.

Wow, now that you point it out it seems so obvious that I should have recognized that flag as being suspicious and investigated more deeply. Taking that off fixed my issue. Thank you!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.