How to count matched phrases with slop

Now I am using scripted similarity to count phrases in matched documents.

"similarity": {
    "my_scripted_formula": {
        "type": "scripted",
        "script": {"source": "double tf = doc.freq; return query.boost * tf;"},

And to get number of phases I should divide score by number of words in phrase.

For example


docs = [{
        'call_id': 30,
        'transcript': 'abcd def ght errr ght def',
        'call_id': 31,
        'transcript': 'def gaaa ddf ght',

Search body:

body={'_source': ['call_id', 'transcript'],
 'query': {'bool': {
        'minimum_should_match': 1,
        'should': [{'match_phrase': {'transcript': 'def ght'}}, 
                   {'match_phrase': {'transcript': 'def gaaa'}, 
                   {'match_phrase': {'transcript': 'gaaa ddf'}}]}}}

Search result:

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 4, 'total': 4},
 'hits': {'hits': [{'_id': '31',
                    '_index': 'test-index',
                    '_score': 4.0,
                    '_source': {'call_id': 31,
                                'transcript': 'def gaaa ddf ght'},
                    '_type': '_doc'},
                   {'_id': '30',
                    '_index': 'test-index',
                    '_score': 2.0,
                    '_source': {'call_id': 30,
                                'transcript': 'abcd def ght errr ght '
                    '_type': '_doc'}],
          'max_score': 4.0,
          'total': {'relation': 'eq', 'value': 2}},
 'timed_out': False,
 'took': 3}

I divide score by number of words (2.0 by 2 and 4.0 by 2) and got numbers of matches: 1 and 2.

But when I use span query with slop, I got unpredictable results:
Search body:

body = {'_source': ['call_id', 'transcript'],
        'query': {'bool': {'should': [
            {'span_near': {
                'clauses': [{'span_term': {'transcript': 'def'}},
                            {'span_term': {'transcript': 'ght'}}],
                'in_order': False,
                'slop': 2}}


{'_shards': {'failed': 0, 'skipped': 0, 'successful': 4, 'total': 4},
 'hits': {'hits': [{'_id': '30',
                    '_index': 'test-index',
                    '_score': 1.7333333,
                    '_source': {'call_id': 30, 'transcript': 'abcd def ght errr ght def'},
                    '_type': '_doc'},
                   {'_id': '31',
                    '_index': 'test-index',
                    '_score': 0.4,
                    '_source': {'call_id': 31, 'transcript': 'def gaaa ddf ght'},
                    '_type': '_doc'}],
          'max_score': 1.7333333,
          'total': {'relation': 'eq', 'value': 2}},
 'timed_out': False,
 'took': 4}

I found out that doc.freq in my formula is not integer and it depends on order of words in the text, but I don't know how to disable this feature.
So the question is How to count matched phrases with slop? What scripted formula should I use?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.