Boosting by number of matched terms


(Alexey Danilov) #1

Dear community, I have two questions regarding elasticsearch which keep me from sleeping at night. Hopefully, there will be a kind person to restore my sleep and well-being :smile:

These questions are as follows:
1). How can one boost query by the number of matched terms?
2). It is possible to return parts of input text that matched?

Requirements: query should account for possibility of misspelled words (hence fuzzy) and expected words might not be close to each other (sloppiness).

This is the data I am operating upon:

create index

POST /suggestions

create mapings

PUT /suggestions/_mapping/sample
> {
>   "properties": {
>    "text": {
>       "type": "string"
>    },
>    "suggestion": {
>       "type": "completion"
>    }
>   }
> }

put some sample data

> PUT suggestions/sample/1
> {
>   "text" : "West Minister Abbey",
>   "suggestion" : "West Minister Abbey"
> }
> PUT suggestions/sample/2
> {
>   "text" : "Road West Corlitz",
>   "suggestion" : "Road West Corlitz"
> }
> PUT suggestions/sample/3
> {
>   "text" : "Westly Park",
>   "suggestion" : "Westly Park"
> }
> PUT suggestions/sample/4
> {
>   "text" : "Wes Square",
>   "suggestion" : "Wes Square"
> }

query time!

> GET /suggestions/sample/_search
> {
>   "query": { 
>     "match": {
>       "text": {
>         "query": "somewhere close to park in abby west",
>         "fuzziness": "AUTO"
>       }
>     }
>   },
>    "highlight": {
>     "fields" : {
>         "text" : {
>         }
>     }
>   }
> }

####results (are not really as I would want to see them)

     {
        "_index": "suggestions",
        "_type": "sample",
        "_id": "3",
        "_score": 0.08928572,
        "_source": {
           "text": "Westly Park",
           "suggestion": "Westly Park"
        },
        "highlight": {
           "text": [
              "Westly <em>Park</em>"
           ]
        }
     },
     {
        "_index": "suggestions",
        "_type": "sample",
        "_id": "1",
        "_score": 0.061370566,
        "_source": {
           "text": "West Minister Abbey",
           "suggestion": "West Minister Abbey"
        },
        "highlight": {
           "text": [
             "<em>West</em> Minister <em>Abbey</em>"
           ]
        }
     },
     {
        "_index": "suggestions",
        "_type": "sample",
        "_id": "2",
        "_score": 0.059432168,
        "_source": {
           "text": "Road West Corlitz",
           "suggestion": "Road West Corlitz"
        },
        "highlight": {
           "text": [
              "Road <em>West</em> Corlitz"
           ]
        }
     },
     {
        "_index": "suggestions",
        "_type": "sample",
        "_id": "4",
        "_score": 0.049526803,
        "_source": {
           "text": "Wes Square",
           "suggestion": "Wes Square"
        },
        "highlight": {
           "text": [
              "<em>Wes</em> Square"
           ]
        }
     }

Desired output is "West Minister Abbey", but it is scored lower than ""Westly Park", and I can take only first suggestion - application should not have any sorting logic, that's completely up to elasticsearch.

So what are the possible solutions?

Issue #1: the way I see it, one way is to boost hits which have more matching terms than others - even if these terms are have less weight in inverted index. I've seen a couple of posts here and on other resources asking specifically this question - alas, no responses were given. Hopefully, I'll have better luck than those fellas? :slight_smile:
Issue 2: is still a mystery. It is quite easy to see which terms matched via highlighting, but which terms caused this match? Given input phrase like "somewhere close to park in abby west" and resulting match "West Minister Abbey" I want to know precisely that "abby west" was a match -> therefore I can transform my query to "somewhere close to park in West Minister Abbey". As you might see, required functionality is a bit like suggesters API - except I've tried both suggester terms and phrases and found them not really acceptable for complex queries like this - mainly because no sloppinnes is allowed (and providing numerous input variations is not an option in my case - they are too many and too different).

Any hint or, better yet, display (and sharing!) of elasticsearch magic is warmly welcome.


(system) #2