Aggregate over multiple keys in the document, without knowing the key details


(User User) #1

I am new to Elastic Search and was exploring aggregation query. The documents I have are in the format -

{"name":"A",
     "class":"10th",
     "subjects":{
         "S1":92,
         "S2":92,
         "S3":92,
     }
}

another document can be -

{"name":"B",
     "class":"10th",
     "subjects":{
         "S3":92,
         "S2":92,
         "S5":92,
     }
}

We have about 40k such documents in our ES with the Subjects varying from student to student. The query to the system can be to aggregate all subject-wise scores for a given class. We tried to create a bucket aggregation query as explained in this guide here, however, this generates a single bucket per document and in our understanding requires an explicit mention of every subject.

We want to system to generate subject wise aggregate for the data by executing a single aggregation query, the problem I face is that in our data the subjects could vary from student to student and we don't have a global list of subject keys.

We wrote the following script but this only works if we know all possible subjects.

GET student_data_v1_1/_search

{ "query" :
    {"match" : 
         { "class" : "' + query + '" }}, 
         "aggs" : { "my_buckets" : { "terms" : 
         { "field" : "subjects", "size":10000 },
         "aggregations": {"the_avg": 
                      {"avg": { "field": "subjects.value" }}} }},
          "size" : 0 }'

but this query only works for the document structure, but does not work multiple subjects are defined where we may not know the key-pair -

{"name":"A",
     "class":"10th",
     "subjects":{
         "value":93
     }
}

An alternate form the document is present is that the subject is a list of dictionaries -

    {"name":"A",
     "class":"10th",
     "subjects":[
         {"S1":92},
         {"S2":92},
         {"S3":92},
     ]
}

Having an aggregation query to solve either of the 2 document formats would be helpful.


(Igor Motov) #2

You will probably have to use nested document type and aggregations for that:

DELETE test

PUT test
{
  "mappings": {
    "_doc": {
      "properties": {
        "class": {
          "type": "keyword"
        },
        "subjects": {
          "type": "nested",
          "properties": {
            "subject": {
              "type": "keyword"
            },
            "score": {
              "type": "integer"
            }
          }
        }
      }
    }
  }
}

PUT test/_doc/1
{
  "name": "A",
  "class": "10th",
  "subjects": [
    {"subject": "S1", "score": 80},
    {"subject": "S2", "score": 85},
    {"subject": "S3", "score": 90}
  ]
}

PUT test/_doc/2
{
  "name": "B",
  "class": "10th",
  "subjects": [
    {"subject": "S3", "score": 90},
    {"subject": "S2", "score": 95},
    {"subject": "S5", "score": 100}
  ]
}

PUT test/_doc/3
{
  "name": "C",
  "class": "9th",
  "subjects": [
    {"subject": "S0", "score": 92},
    {"subject": "S10", "score": 92},
    {"subject": "S11", "score": 92}
  ]
}

GET test/_search
{
  "query": {
    "match": {
      "class": "10th"
    }
  },
  "size": 0,
  "aggs": {
    "subjects": {
      "nested": {
        "path": "subjects"
      },
      "aggs": {
        "subject": {
          "terms": {
            "field": "subjects.subject"
          },
          "aggs": {
            "score_stats": {
              "stats": {
                "field": "subjects.score"
              }
            }
          }
        }
      }
    }
  }
}

(User User) #3

thanks @Igor_Motov this was exactly what I needed.

I am now trying to obtain a weighted average to the individual subject. I have updated the documents to hold a weight with each subject.

PUT test/_doc/3
{
"name": "C",
"class": "9th",
"subjects": [
{"subject": "S0", "score": 92, "weight":20},
{"subject": "S10", "score": 92, "weight":30},
{"subject": "S11", "score": 92, "weight":50}
]
}
GET test/_search
{
"query": {
"match": {
"class": "10th"
}
},
"size": 0,
"aggs": {
"subjects": {
"nested": {
"path": "subjects"
},
"aggs": {
"subject": {
"terms": {
"field": "subjects.subject"
},
"aggs" : { "weighted_grade": { "weighted_avg": { "value": { "field": "subjects.score" }, "weight": { "field": "subjects.weight" } } } }
}
}
}
}
}

however I get the following error -

{u'error': {u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'root_cause': [{u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'type': u'unknown_named_object_exception'}],
u'type': u'unknown_named_object_exception'},
u'status': 400}


(Igor Motov) #4

Which version of elasticsearch are you using?


(User User) #5

"version": {
"number": "6.2.2",
"build_hash": "10b1edd",
"build_date": "2018-02-16T19:01:30.685723Z",
"build_snapshot": false,
"lucene_version": "7.2.1",
"minimum_wire_compatibility_version": "5.6.0",
"minimum_index_compatibility_version": "5.0.0"
},


(Igor Motov) #6

"number": "6.2.2"

weighted_avg was added in 6.4.0. So, your version doesn't support it.


(User User) #7

great, thanks for helping identify and resolve the issue