Elasticsearch nested phrase search within a certain distance

Sample ES Document

{
    // other properties

    "transcript" : [
      {
        "id" : 0,
        "user_type" : "A",
        "phrase" : "hi good afternoon"
      },
      {
        "id" : 1,
        "user_type" : "B",
        "phrase" : "hey"
      }
      {
        "id" : 2,
        "user_type" : "A",
        "phrase_analyzed" : "hi "
      }
      {
        "id" : 3,
        "user_type" : "B",
        "phrase" : "my name is john"
      }
    ]
  }

transcript is a nested field whose mapping looks like

{
   "type":"nested",
   "properties": {
      "id":{
         "type":"integer"
      }
      "phrase": {
         "type":"text",
         "analyzer":"standard"
      },
      "user_type": {
         "type":"keyword"
      }
   }
}

I need to search for two phrases inside transcript that are apart by at max a given distance d.

For example:

  1. If the phrases are hi and name and d is 1, the above document match because hi is present in third nested object, and name is present in fourth nested object. (Note: hi in first nested object and name in fourth nested object is NOT valid, as they are apart by more than d=1 distance)

  2. If the phrases are good and name and d is 1, the above document does not match because good and name are 3 distance apart.

  3. If both phrases are present in same sentence, the distance is considered as 0.

Possible Solution:

  1. I can fetch all documents where both phrases are present and on the application side, I can discard documents where phrases were more than the given threshold(d) apart. The problem in this case could be that I cannot get the count of such documents beforehand in order to show in the UI as found in 100 documents out of 1900 (as without processing from application side, we can't be sure if the document is indeed a match or not, and it's not feasible to do processing for all documents in index)

  2. Second possible solution is:

{
	"query": {
		"bool": {

			// suppose d = 2

			// if first phrase occurs at 0th offset, second phrase can occur at 
			// ... 0th, 1st or 2nd offset

			// if first phrase occurs at 1st offset, second phrase can occur at 
			// ... 1st, 2nd or 3rd offset

			// any one of above permutation should exist

			"should": [
				{
					// search for 1st permutation
				},
				{
					// search for 2nd permutation
				},
				...
			]
		}
	}
}

This is clearly not scalable as if d is large, and if the transcript is large, the query is going to be very very big.

Kindly suggest any approach.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.