Strategy for matching unstructured text to phrases in index

I'm trying to extract data from product descriptions, and I have the catalog data indexed and categorized. For example, I have a color name index (that contains all possible colors for the product), and I want to be able to find the best color match for given query string:

This is a 2017 pathfinder in beautiful arctic blue trim.

In other words, rather than searching for a phrase in indexed documents, phrases themselves are in the index, and my query input is the full text. What would be the best approach to solve this? Sounds pretty easy at first glance, but so far I couldn't find a query that works for me.

  • Standard match doesn't do a good job because it's too broad; matches a bunch of random stuff that affect the score.
  • Tried tokenizing the input and match multiple terms, using span_near query to keep the term order, but it tries to match all terms thus returns no result. For example:
  "query": {
    "span_near": {
      "clauses": [
        { "span_term": { "name": "this" } },
        { "span_term": { "name": "2017" } },
        { "span_term": { "name": "pathfinder" } },
        { "span_term": { "name": "beautiful" } },
        { "span_term": { "name": "arctic" } },
        { "span_term": { "name": "blue" } },
        { "span_term": { "name": "trim" } }
      ],
      "slop": 0,
      "in_order": true
    }
  }
  • more_like_this looks promising, but it has limited config options and lucene scoring formula acts against me (for example in this specific case, there is actually a color named "beautiful blue", which scores higher than "arctic blue").
  "query": {
    "more_like_this" : {
      "fields" : ["color"],
      "like" : "This is a 2017 pathfinder in beautiful arctic blue trim.",
      "min_term_freq" : 1
    }
  }

I feel like I'm missing something; because this should be a simple substring search. What I need is the exact functionality of match_phrase, except query and index are inversed.

Hi @rustunooldu

Not my area of expertise but I think perhaps you may be describing a percolator query

1 Like

Hi @stephenb
I actually saw percolate query when I was looking for a solution, but I must have misinterpreted how it works, and I didn't pay much attention to it. Turns out it's exactly what I needed, thank you for the suggestion.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.