ElasticSearch document frequency on bucket aggregation


#1

Hello,
I currently am having difficulties performing a very specific script on elasticsearch. I seek to apply point-wise mutual information to analyse relations between concepts in text.

My mapping (simplified):

publication:{
    annotations:{ <- Nested Object, include_in_parent
        sentences:{ <- Nested Object, include_in_parent
            id:{<- String list of found ID's, doc_values = true
            }
        }
    }
}

To compute Point-Wise Mutual Information i need the Document Frequency of id_a and id_b and the number of documents where they co-occur. id_a is given by the query's hits, the co-occurrences by the bucket's doc_count. My problem lies in finding the document frequency of the bucket's id.

I've tried multiple ways to resolve the following situation. Given an ID, find how many times each other ID occurs in the same document while obtaining the aggregation id's document frequency in the field.

body={
    "fields" : ['annotations.sentences.id'],
    "query": {
        "match": {"annotations.sentences.id" : id}
    },
    "size": 0,
    "aggs" : {
      "cuis" : {
        "terms" : {"size":0,
              "field" : "annotations.sentences.id",
              "script": "_value + '|' + _index[\"annotations.sentences.id\"][_value].df()"
         }
        }             
    }
})

I only used this approach because I had problems finding a way to access a term aggregation bucket's key. In the client side I resolve and split the data, however the document frequency value and the actual value differ by a large margin. If i try to access the ttf() value the result is -1.
Since the DF was too large I assumed it was because of the nested structure so I changed the mapping to include in the root of the document a field called "id_lst", containing a list of all ID's a document has .
That resulted in the following query:

body={
    "query" : {  
    	 "bool":{
            "must": {
              "match_all": {}
              },
            "filter":{
                "term":{
                    "cui_lst" : id
                }
            }
        }
    },
    "size" : 0,
    "aggregations" : {
        "cuis" : {
            "terms" : { "size":0,
             "field" : "cui_lst",
             "script" : "_value + '|' + _index[\"cui_lst\"][_value].df()"
            }
        }
    }

However the document frequencies still differ from the real values.
Am I approaching this the wrong way? I've used scripts, terms aggregations, significant term aggregations, groovy scripts and still I can't have reliable results on the count of documents the aggregated ID appears in.
Is there an efficient way that I have overlooked to resolve this issue?


(system) #2