Need help with an aggregation

Jan_van_Vlimmeren · August 26, 2015, 12:04pm

I already spent way too much time on this. I have this aggregation that I can't get to work like it should.

I have a ton of documents with a structure like this (I've ommited some of the parts that aren't relevant for this agg):

 {
     "url": "THE_URL",
         "url_params": {},
         "title": "Het Nieuwsblad",
     "referrer": null,
     "time": "2015-08-25T08:35:15.729Z",
     "referrerHost": null,
     "timeOnSite": 16,
     "blocks": [{
     "viewTime": 11,
         "click": 0,
         "view": 1,
         "block": "gentenaar__gentenaar-header"
 }, {
     "viewTime": 11,
         "click": 0,
         "view": 1,
         "block": "gentenaar__gentenaar-headline"
 }, ..., {
     "viewTime": 11,
         "click": 0,
         "view": 1,
         "block": "news__fast-news-sidebar-4-left"
 }, {
     "viewTime": 11,
         "click": 0,
         "view": 1,
         "block": "news__fast-news-sidebar-4-right"
 }, {
     "viewTime": 1,
         "click": 0,
         "view": 1,
         "block": "news__fast-news-sidebar-5"
 }, {
     "viewTime": 0,
         "click": 0,
         "view": 0,
         "block": "news__fast-news-sidebar-6-left"
 }, ...],
     "activeAds": [{
     "viewed": 1,
         "clicked": 0,
         "type": "button",
         "width": 317,
         "height": 75,
         "timeViewed": 10
 }, {
     "viewed": 1,
         "clicked": 0,
         "type": "xlleaderboard",
         "width": 990,
         "height": 122,
         "timeViewed": 10
     }, ...]...
 }

It's the blocks array that I'm having problems with. I want to know per block the sum of clicks, views and viewTime.

I'm trying to use this query:

{
  "size": 0,
  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      },
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "dt": {
                  "from": "2015-08-25T00:00:00+00:00",
                  "to": "2015-08-26T23:59:59+00:00"
                 }
               }
            },
            {
              "term": {
                 "url": "http://www.nieuwsblad.be/"
              }
             }
           ]
        }
      }
     }
   },
  "aggs": {
    "per_block": {
      "terms": {
        "field": "blocks.block"
      },
      "aggs": {
        "clicks": {
          "sum": {
            "field": "blocks.click"
           }
         }
      }
     }
   }
}

For the activeAds, I'm using nearly identical code but there it's working perfectly:

{
  "size": 0,
  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      },
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "dt": {
                  "from": "2015-08-25T00:00:00+00:00",
                  "to": "2015-08-26T23:59:59+00:00"
                }
              }
            },
            {
              "terms": {
                "page.page_type": [
                  "home"
                ]
              }
            }
          ],
          "must_not": {
            "term": {
              "activeAds.timeViewed": 0
            }
          }
        }
      }
    }
  },
  "aggs": {
    "per_ad": {
      "terms": {
        "field": "activeAds.type"
      },
      "aggs": {
        "clicks": {
          "sum": {
            "field": "activeAds.clicked"
           }
        },
         "views": {
          "sum": {
            "field": "activeAds.viewed"
          }
        },
        "total_time_viewed": {
          "sum": {
            "field": "activeAds.timeViewed"
          }
        }
      }
    }
  }
}

Anyone have any idea on what I'm doing wrong?

colings86 · August 26, 2015, 12:17pm

What is the error or problem you are finding when running the 'blocks' aggregation? What do you expect to get instead?

Jan_van_Vlimmeren · August 26, 2015, 12:23pm

I'm not getting an error but the aggregation is returning the total sum of all clicks in all blocks instead of specifically for the block in the 'parent' aggregation. The block "crosspromo__crosspromo-header" should have a different number of clicks than "regio__regio-sidebar":

{
"took": 23427,
"timed_out": false,
"_shards": {
    "total": 42,
    "successful": 42,
    "failed": 0
},
"hits": {
    "total": 1557373,
    "max_score": 0.0,
    "hits": []
},
"aggregations": {
    "per_block": {
        "doc_count_error_upper_bound": 1555695,
        "sum_other_doc_count": 223013931,
        "buckets": [{
            "key": "crosspromo__crosspromo-header",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }, {
            "key": "regio__regio-sidebar",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }, {
            "key": "she__she-news-head-4",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }, {
            "key": "she__she-news-head-5",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }, {
            "key": "she__she-news-head-6",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }, {
            "key": "she__she-news-head-7",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }, {
            "key": "she__she-news-head-8",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }, {
            "key": "she__she-news-head-9",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }, {
            "key": "she__she-promo",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }, {
            "key": "she__she-sidebar-1",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }]
    }
  }
}

colings86 · August 26, 2015, 12:36pm

Aggregations bucket documents at each level so the per_block terms aggregation in your request is putting documents (the whole document) into buckets based on the value of the block.block and then the clicks aggregation will evaluate the sum of all the blocks.click values across all the documents that fall into each bucket.

If you need to perform aggregations on the inner objects you may want to look into Nested Objects Type and use the Nested Aggregation to concentrate parts of your aggregation request on the inner objects rather than the document as a whole.

Another option would be to index the blocks as individual documents, possibly copying the url, referrer, etc into each of those documents if you need the information on the per block level.

Jan_van_Vlimmeren · August 26, 2015, 12:44pm

Aggregations bucket documents at each level so the per_block terms aggregation in your request is putting documents (the whole document) into buckets based on the value of the block.block and then the clicks aggregation will evaluate the sum of all the blocks.click values across all the documents that fall into each bucket.

That's just it, it's not evaluating the sum of all documents that fall into each bucket. It's doing the sum of all documents period. I'm trying to recreate what I'm doing for the "activeAds" where I am getting correct results. The documents have the same structure so by just changing the names of the fields, I should be getting similar results but I don't.

colings86 · August 26, 2015, 1:04pm

The example you posted at the top is a single document. I am going to take this document as an example. The document has multiple blocks so the document will "fall" into multiple buckets (it will be associated with each bucket that it has a block.block value for. So in the example document it will be associated with 6 buckets. Within each bucket the sum aggregation will take the document and sum up the values of blocks.clicks in the document. That means that for the example document in each bucket the document will contribute 6 values to the sum aggregation since there are 6 values for block.click in the document. If the blocks.blocks values overlap a lot between your documents (i.e. most of your documents contain the same blocks.blocks values, then the majority of your documents will fall into every bucket and the sum aggregation will sum up to the same value.

This is why you should separate the block objects, either by indexing them as separate documents or by indexing them as Nested Objects so you can use the Nested Aggregation to ensure they are assigned separately to buckets rather than all together in one combine document.

Topic		Replies	Views
Aggregating array of objects Elasticsearch	3	2147	January 11, 2019
Sub-aggregations not working as expected Elasticsearch	3	386	July 6, 2017
Aggregate sum wrong with huge datasets Elasticsearch	1	282	March 25, 2021
Help with aggregations Elasticsearch	4	1863	December 25, 2017
Aggregation problem Elasticsearch	2	333	July 6, 2017

Need help with an aggregation

Related topics