Now I am using scripted similarity
to count phrases in matched documents.
"similarity": {
"my_scripted_formula": {
"type": "scripted",
"script": {"source": "double tf = doc.freq; return query.boost * tf;"},
},
}
And to get number of phases I should divide score
by number of words in phrase.
For example
Documents:
docs = [{
'call_id': 30,
'transcript': 'abcd def ght errr ght def',
},{
'call_id': 31,
'transcript': 'def gaaa ddf ght',
},
]
Search body:
body={'_source': ['call_id', 'transcript'],
'query': {'bool': {
'minimum_should_match': 1,
'should': [{'match_phrase': {'transcript': 'def ght'}},
{'match_phrase': {'transcript': 'def gaaa'},
{'match_phrase': {'transcript': 'gaaa ddf'}}]}}}
Search result:
{'_shards': {'failed': 0, 'skipped': 0, 'successful': 4, 'total': 4},
'hits': {'hits': [{'_id': '31',
'_index': 'test-index',
'_score': 4.0,
'_source': {'call_id': 31,
'transcript': 'def gaaa ddf ght'},
'_type': '_doc'},
{'_id': '30',
'_index': 'test-index',
'_score': 2.0,
'_source': {'call_id': 30,
'transcript': 'abcd def ght errr ght '
'def'},
'_type': '_doc'}],
'max_score': 4.0,
'total': {'relation': 'eq', 'value': 2}},
'timed_out': False,
'took': 3}
I divide score by number of words (2.0
by 2 and 4.0
by 2) and got numbers of matches: 1 and 2.
But when I use span query with slop, I got unpredictable results:
Search body:
body = {'_source': ['call_id', 'transcript'],
'query': {'bool': {'should': [
{'span_near': {
'clauses': [{'span_term': {'transcript': 'def'}},
{'span_term': {'transcript': 'ght'}}],
'in_order': False,
'slop': 2}}
]}}}
Result:
{'_shards': {'failed': 0, 'skipped': 0, 'successful': 4, 'total': 4},
'hits': {'hits': [{'_id': '30',
'_index': 'test-index',
'_score': 1.7333333,
'_source': {'call_id': 30, 'transcript': 'abcd def ght errr ght def'},
'_type': '_doc'},
{'_id': '31',
'_index': 'test-index',
'_score': 0.4,
'_source': {'call_id': 31, 'transcript': 'def gaaa ddf ght'},
'_type': '_doc'}],
'max_score': 1.7333333,
'total': {'relation': 'eq', 'value': 2}},
'timed_out': False,
'took': 4}
I found out that doc.freq
in my formula is not integer and it depends on order of words in the text, but I don't know how to disable this feature.
So the question is How to count matched phrases with slop? What scripted formula should I use?