I'm trying to extract data from product descriptions, and I have the catalog data indexed and categorized. For example, I have a color name index (that contains all possible colors for the product), and I want to be able to find the best color match for given query string:
This is a 2017 pathfinder in beautiful arctic blue trim.
In other words, rather than searching for a phrase in indexed documents, phrases themselves are in the index, and my query input is the full text. What would be the best approach to solve this? Sounds pretty easy at first glance, but so far I couldn't find a query that works for me.
- Standard match doesn't do a good job because it's too broad; matches a bunch of random stuff that affect the score.
- Tried tokenizing the input and match multiple terms, using span_near query to keep the term order, but it tries to match all terms thus returns no result. For example:
"query": {
"span_near": {
"clauses": [
{ "span_term": { "name": "this" } },
{ "span_term": { "name": "2017" } },
{ "span_term": { "name": "pathfinder" } },
{ "span_term": { "name": "beautiful" } },
{ "span_term": { "name": "arctic" } },
{ "span_term": { "name": "blue" } },
{ "span_term": { "name": "trim" } }
],
"slop": 0,
"in_order": true
}
}
- more_like_this looks promising, but it has limited config options and lucene scoring formula acts against me (for example in this specific case, there is actually a color named "beautiful blue", which scores higher than "arctic blue").
"query": {
"more_like_this" : {
"fields" : ["color"],
"like" : "This is a 2017 pathfinder in beautiful arctic blue trim.",
"min_term_freq" : 1
}
}
I feel like I'm missing something; because this should be a simple substring search. What I need is the exact functionality of match_phrase, except query and index are inversed.