Strategy for matching unstructured text to phrases in index

rustunooldu · July 4, 2023, 5:34pm

I'm trying to extract data from product descriptions, and I have the catalog data indexed and categorized. For example, I have a color name index (that contains all possible colors for the product), and I want to be able to find the best color match for given query string:

This is a 2017 pathfinder in beautiful arctic blue trim.

In other words, rather than searching for a phrase in indexed documents, phrases themselves are in the index, and my query input is the full text. What would be the best approach to solve this? Sounds pretty easy at first glance, but so far I couldn't find a query that works for me.

Standard match doesn't do a good job because it's too broad; matches a bunch of random stuff that affect the score.
Tried tokenizing the input and match multiple terms, using span_near query to keep the term order, but it tries to match all terms thus returns no result. For example:

  "query": {
    "span_near": {
      "clauses": [
        { "span_term": { "name": "this" } },
        { "span_term": { "name": "2017" } },
        { "span_term": { "name": "pathfinder" } },
        { "span_term": { "name": "beautiful" } },
        { "span_term": { "name": "arctic" } },
        { "span_term": { "name": "blue" } },
        { "span_term": { "name": "trim" } }
      ],
      "slop": 0,
      "in_order": true
    }
  }

more_like_this looks promising, but it has limited config options and lucene scoring formula acts against me (for example in this specific case, there is actually a color named "beautiful blue", which scores higher than "arctic blue").

  "query": {
    "more_like_this" : {
      "fields" : ["color"],
      "like" : "This is a 2017 pathfinder in beautiful arctic blue trim.",
      "min_term_freq" : 1
    }
  }

I feel like I'm missing something; because this should be a simple substring search. What I need is the exact functionality of match_phrase, except query and index are inversed.

stephenb · July 4, 2023, 6:26pm

Hi @rustunooldu

Not my area of expertise but I think perhaps you may be describing a percolator query

rustunooldu · July 4, 2023, 7:39pm

Hi @stephenb
I actually saw percolate query when I was looking for a solution, but I must have misinterpreted how it works, and I didn't pay much attention to it. Turns out it's exactly what I needed, thank you for the suggestion.

system · August 1, 2023, 7:40pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Highlighting issue with proximity phrase match Elasticsearch	1	577	July 6, 2017
Partial phrase or exact phrase matching Elasticsearch	10	7229	August 20, 2020
Match phrase queries to highlighted values Elasticsearch	1	345	July 6, 2017
Regex + phrase search Elasticsearch	5	2758	January 9, 2018
Match a phrase keep in order, but allow slop requires use of Spans right? Elasticsearch	1	1062	July 6, 2017

Strategy for matching unstructured text to phrases in index

Related topics