Sample ES Document
{
// other properties
"transcript" : [
{
"id" : 0,
"user_type" : "A",
"phrase" : "hi good afternoon"
},
{
"id" : 1,
"user_type" : "B",
"phrase" : "hey"
}
{
"id" : 2,
"user_type" : "A",
"phrase_analyzed" : "hi "
}
{
"id" : 3,
"user_type" : "B",
"phrase" : "my name is john"
}
]
}
transcript
is a nested field whose mapping looks like
{
"type":"nested",
"properties": {
"id":{
"type":"integer"
}
"phrase": {
"type":"text",
"analyzer":"standard"
},
"user_type": {
"type":"keyword"
}
}
}
I need to search for two phrases inside transcript that are apart by at max a given distance d
.
For example:
-
If the phrases are
hi
andname
andd
is1
, the above document match becausehi
is present in third nested object, andname
is present in fourth nested object. (Note:hi
in first nested object andname
in fourth nested object is NOT valid, as they are apart by more thand=1
distance) -
If the phrases are
good
andname
andd
is1
, the above document does not match becausegood
andname
are3
distance apart. -
If both phrases are present in same sentence, the distance is considered as
0
.
Possible Solution:
-
I can fetch all documents where both phrases are present and on the application side, I can discard documents where phrases were more than the given threshold(
d
) apart. The problem in this case could be that I cannot get the count of such documents beforehand in order to show in the UI as found in 100 documents out of 1900 (as without processing from application side, we can't be sure if the document is indeed a match or not, and it's not feasible to do processing for all documents in index) -
Second possible solution is:
{
"query": {
"bool": {
// suppose d = 2
// if first phrase occurs at 0th offset, second phrase can occur at
// ... 0th, 1st or 2nd offset
// if first phrase occurs at 1st offset, second phrase can occur at
// ... 1st, 2nd or 3rd offset
// any one of above permutation should exist
"should": [
{
// search for 1st permutation
},
{
// search for 2nd permutation
},
...
]
}
}
}
This is clearly not scalable as if d
is large, and if the transcript is large, the query is going to be very very big.
Kindly suggest any approach.