Finding parts of user query that were not matched

I am working on a project where a user is provided with Suggest as you type functionality. The suggestions are provided using a separate suggest index which is build from query history and predefined product categories. After the user selects the suggestion, the user will be redirected to a search page where corresponding filters (product categories) are selected and are queried on another index. For the categories this will happen based on an exact term query, but some parts might happen using text search. I am looking for a way to find tokens in the user query that could not be matched against a suggestion. Such that the users can search for a certain keywords within a category.

As an example (this is not my real use-case) let's propose i want a user to search for products, such products have properties that we could auto suggest like product category and manufacturer. However for titles and description in this case I want to use a more full-text base approach and due to a lack of available user search queries i can't just extract possible terms from there.

Consider the following auto-suggest index (all queries are executes on ES 6.1):

PUT /search-suggest
{
    "settings": {
        "number_of_shards": 1, 
        "analysis": {
            "filter": {
                "autocomplete_filter": { 
                    "type":     "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 20
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter" 
                    ]
                }
            }
        }
    }
}

With the following mapping, containing the category and manufacturer as properties of the auto complete query. Which can be used set filters on the users search.

PUT /search-suggest/_mapping/search-suggest
{
    "search-suggest": {
        "properties": {
            "suggestion": {
                "type":     "text",
                "analyzer": "autocomplete"
            },
            "category": {
                "type": "keyword"
            },
            "manufacturer": {
                "type": "keyword"
            }
        }
    }
}

And consider the following example suggestion documents:

POST /search-suggest/search-suggest/_bulk
{ "index": { "_id": 1            }}
{ "name": "Gaming", "category":"gaming"     }
{ "index": { "_id": 2            }}
{ "name": "Sony in gaming", "category":"gaming", "manufacturer": "sony"    }

We can query it like this:

GET /search-suggest/search-suggest/_search
{
    "query": {
        "match": {
            "name": {
              "query": "Sny playstation",
              "fuzziness": "auto"
            }
        }
    }
}

Which would return the following document:

 {
      "name": "Sony in gaming",
      "category": "gaming",
      "manufacturer": "sony"
 }

Now, I am looking for a way that sny has been matched in my query and it's catched by the filter manufacturer:sony. But i want to somehow find out that "playstation" has not been matched and i can use that as free text search (this make more sense in my actual use case). So you would get something like search for "playstation" in products with manufacturer sony.

I figured that i could try to use highlighting to find the matched keywords, then use the difference between the query and matched keywords to :

GET /search-suggest/search-suggest/_search
{
    "query": {
        "match": {
            "name": {
              "query": "Sny playstation",
              "fuzziness": "auto"
            }
        }
    },
    "highlight": {
        "fields":{
          "name": {
            "type": "plain"
          }
        }
    }
}

Resulting in the following highlight, however i cannot easily match the fuzzy token..

"highlight": {
          "name": [
            "<em>Sony</em> in gaming"
          ]
        }

Does anybody have an idea about how I could solve this or any other methods that might work? For my use case I also have been thinking about using significant text to extract certain important keywords for each category.

Thanks in advance.

That's probably:

  1. tough for you to parse out
  2. slow to execute
  3. non-exhaustive - you wouldn't want to subject ALL matching docs to this analysis

I'd be tempted to use the query in a fuzzy fashion (ORed terms, use of fuzzy or n-grams, optional phrase query) and use the sampler agg to consider only the best matching hits from your search history. Under the sampler aggregation I'd use the significant_terms agg to look at structured fields (clicked product codes/manufacturers/ departments) and this will help identify useful query refinement suggestions of varying granularity.

Rather than treating the suggested terms as a flat list it can make sense to try further organise these terms into hierarchies or groups. If you have an ambiguous query like "mixer" you can expect a mix of departments and manufacturers etc relating to both DJ mixers and food mixers. It can be helpful if your app tries to identify these potential ambiguities and perhaps offer "did you mean mixer as in DJ?" or similar.
Ambiguities like this can be seen by looking at the connections between the selected significant terms. You can use the adjacency_matrix agg or Graph API to plot these terms and look to see if you have a single coherent island of related concepts or distinct "islands" of terms that represent different interpretations of the query.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.