Aggregate over multiple keys in the document, without knowing the key details

I am new to Elastic Search and was exploring aggregation query. The documents I have are in the format -

{"name":"A",
     "class":"10th",
     "subjects":{
         "S1":92,
         "S2":92,
         "S3":92,
     }
}

another document can be -

{"name":"B",
     "class":"10th",
     "subjects":{
         "S3":92,
         "S2":92,
         "S5":92,
     }
}

We have about 40k such documents in our ES with the Subjects varying from student to student. The query to the system can be to aggregate all subject-wise scores for a given class. We tried to create a bucket aggregation query as explained in this guide here, however, this generates a single bucket per document and in our understanding requires an explicit mention of every subject.

We want to system to generate subject wise aggregate for the data by executing a single aggregation query, the problem I face is that in our data the subjects could vary from student to student and we don't have a global list of subject keys.

We wrote the following script but this only works if we know all possible subjects.

GET student_data_v1_1/_search

{ "query" :
    {"match" : 
         { "class" : "' + query + '" }}, 
         "aggs" : { "my_buckets" : { "terms" : 
         { "field" : "subjects", "size":10000 },
         "aggregations": {"the_avg": 
                      {"avg": { "field": "subjects.value" }}} }},
          "size" : 0 }'

but this query only works for the document structure, but does not work multiple subjects are defined where we may not know the key-pair -

{"name":"A",
     "class":"10th",
     "subjects":{
         "value":93
     }
}

An alternate form the document is present is that the subject is a list of dictionaries -

    {"name":"A",
     "class":"10th",
     "subjects":[
         {"S1":92},
         {"S2":92},
         {"S3":92},
     ]
}

Having an aggregation query to solve either of the 2 document formats would be helpful.

You will probably have to use nested document type and aggregations for that:

DELETE test

PUT test
{
  "mappings": {
    "_doc": {
      "properties": {
        "class": {
          "type": "keyword"
        },
        "subjects": {
          "type": "nested",
          "properties": {
            "subject": {
              "type": "keyword"
            },
            "score": {
              "type": "integer"
            }
          }
        }
      }
    }
  }
}

PUT test/_doc/1
{
  "name": "A",
  "class": "10th",
  "subjects": [
    {"subject": "S1", "score": 80},
    {"subject": "S2", "score": 85},
    {"subject": "S3", "score": 90}
  ]
}

PUT test/_doc/2
{
  "name": "B",
  "class": "10th",
  "subjects": [
    {"subject": "S3", "score": 90},
    {"subject": "S2", "score": 95},
    {"subject": "S5", "score": 100}
  ]
}

PUT test/_doc/3
{
  "name": "C",
  "class": "9th",
  "subjects": [
    {"subject": "S0", "score": 92},
    {"subject": "S10", "score": 92},
    {"subject": "S11", "score": 92}
  ]
}

GET test/_search
{
  "query": {
    "match": {
      "class": "10th"
    }
  },
  "size": 0,
  "aggs": {
    "subjects": {
      "nested": {
        "path": "subjects"
      },
      "aggs": {
        "subject": {
          "terms": {
            "field": "subjects.subject"
          },
          "aggs": {
            "score_stats": {
              "stats": {
                "field": "subjects.score"
              }
            }
          }
        }
      }
    }
  }
}

thanks @Igor_Motov this was exactly what I needed.

I am now trying to obtain a weighted average to the individual subject. I have updated the documents to hold a weight with each subject.

PUT test/_doc/3
{
"name": "C",
"class": "9th",
"subjects": [
{"subject": "S0", "score": 92, "weight":20},
{"subject": "S10", "score": 92, "weight":30},
{"subject": "S11", "score": 92, "weight":50}
]
}
GET test/_search
{
"query": {
"match": {
"class": "10th"
}
},
"size": 0,
"aggs": {
"subjects": {
"nested": {
"path": "subjects"
},
"aggs": {
"subject": {
"terms": {
"field": "subjects.subject"
},
"aggs" : { "weighted_grade": { "weighted_avg": { "value": { "field": "subjects.score" }, "weight": { "field": "subjects.weight" } } } }
}
}
}
}
}

however I get the following error -

{u'error': {u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'root_cause': [{u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'type': u'unknown_named_object_exception'}],
u'type': u'unknown_named_object_exception'},
u'status': 400}

Which version of elasticsearch are you using?

"version": {
"number": "6.2.2",
"build_hash": "10b1edd",
"build_date": "2018-02-16T19:01:30.685723Z",
"build_snapshot": false,
"lucene_version": "7.2.1",
"minimum_wire_compatibility_version": "5.6.0",
"minimum_index_compatibility_version": "5.0.0"
},

"number": "6.2.2"

weighted_avg was added in 6.4.0. So, your version doesn't support it.

great, thanks for helping identify and resolve the issue

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.