ElasticSearch document frequency on bucket aggregation

sekas · February 16, 2016, 5:49pm

Hello,
I currently am having difficulties performing a very specific script on elasticsearch. I seek to apply point-wise mutual information to analyse relations between concepts in text.

My mapping (simplified):

publication:{
    annotations:{ <- Nested Object, include_in_parent
        sentences:{ <- Nested Object, include_in_parent
            id:{<- String list of found ID's, doc_values = true
            }
        }
    }
}

To compute Point-Wise Mutual Information i need the Document Frequency of id_a and id_b and the number of documents where they co-occur. id_a is given by the query's hits, the co-occurrences by the bucket's doc_count. My problem lies in finding the document frequency of the bucket's id.

I've tried multiple ways to resolve the following situation. Given an ID, find how many times each other ID occurs in the same document while obtaining the aggregation id's document frequency in the field.

body={
    "fields" : ['annotations.sentences.id'],
    "query": {
        "match": {"annotations.sentences.id" : id}
    },
    "size": 0,
    "aggs" : {
      "cuis" : {
        "terms" : {"size":0,
              "field" : "annotations.sentences.id",
              "script": "_value + '|' + _index[\"annotations.sentences.id\"][_value].df()"
         }
        }             
    }
})

I only used this approach because I had problems finding a way to access a term aggregation bucket's key. In the client side I resolve and split the data, however the document frequency value and the actual value differ by a large margin. If i try to access the ttf() value the result is -1.
Since the DF was too large I assumed it was because of the nested structure so I changed the mapping to include in the root of the document a field called "id_lst", containing a list of all ID's a document has .
That resulted in the following query:

body={
    "query" : {  
    	 "bool":{
            "must": {
              "match_all": {}
              },
            "filter":{
                "term":{
                    "cui_lst" : id
                }
            }
        }
    },
    "size" : 0,
    "aggregations" : {
        "cuis" : {
            "terms" : { "size":0,
             "field" : "cui_lst",
             "script" : "_value + '|' + _index[\"cui_lst\"][_value].df()"
            }
        }
    }

However the document frequencies still differ from the real values.
Am I approaching this the wrong way? I've used scripts, terms aggregations, significant term aggregations, groovy scripts and still I can't have reliable results on the count of documents the aggregated ID appears in.
Is there an efficient way that I have overlooked to resolve this issue?

Topic		Replies	Views
Need help with co-occurrence of values in nested documents across documents Elasticsearch	1	280	November 26, 2020
Use nested doc_count in terms aggregation Elasticsearch	2	5033	April 24, 2017
Buckets of documents grouped by term frequency Elasticsearch	3	594	July 5, 2017
Aggregation based on Array Intersection Elasticsearch	3	692	November 11, 2020
Obtain doc_count values from terms aggregation in bucket Elasticsearch	1	856	November 9, 2021

ElasticSearch document frequency on bucket aggregation

Related topics