Need help with an aggregation


(Jan van Vlimmeren) #1

I already spent way too much time on this. I have this aggregation that I can't get to work like it should.

I have a ton of documents with a structure like this (I've ommited some of the parts that aren't relevant for this agg):

 {
     "url": "THE_URL",
         "url_params": {},
         "title": "Het Nieuwsblad",
     "referrer": null,
     "time": "2015-08-25T08:35:15.729Z",
     "referrerHost": null,
     "timeOnSite": 16,
     "blocks": [{
     "viewTime": 11,
         "click": 0,
         "view": 1,
         "block": "gentenaar__gentenaar-header"
 }, {
     "viewTime": 11,
         "click": 0,
         "view": 1,
         "block": "gentenaar__gentenaar-headline"
 }, ..., {
     "viewTime": 11,
         "click": 0,
         "view": 1,
         "block": "news__fast-news-sidebar-4-left"
 }, {
     "viewTime": 11,
         "click": 0,
         "view": 1,
         "block": "news__fast-news-sidebar-4-right"
 }, {
     "viewTime": 1,
         "click": 0,
         "view": 1,
         "block": "news__fast-news-sidebar-5"
 }, {
     "viewTime": 0,
         "click": 0,
         "view": 0,
         "block": "news__fast-news-sidebar-6-left"
 }, ...],
     "activeAds": [{
     "viewed": 1,
         "clicked": 0,
         "type": "button",
         "width": 317,
         "height": 75,
         "timeViewed": 10
 }, {
     "viewed": 1,
         "clicked": 0,
         "type": "xlleaderboard",
         "width": 990,
         "height": 122,
         "timeViewed": 10
     }, ...]...
 }

It's the blocks array that I'm having problems with. I want to know per block the sum of clicks, views and viewTime.

I'm trying to use this query:

{
  "size": 0,
  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      },
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "dt": {
                  "from": "2015-08-25T00:00:00+00:00",
                  "to": "2015-08-26T23:59:59+00:00"
                 }
               }
            },
            {
              "term": {
                 "url": "http://www.nieuwsblad.be/"
              }
             }
           ]
        }
      }
     }
   },
  "aggs": {
    "per_block": {
      "terms": {
        "field": "blocks.block"
      },
      "aggs": {
        "clicks": {
          "sum": {
            "field": "blocks.click"
           }
         }
      }
     }
   }
}

For the activeAds, I'm using nearly identical code but there it's working perfectly:

{
  "size": 0,
  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      },
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "dt": {
                  "from": "2015-08-25T00:00:00+00:00",
                  "to": "2015-08-26T23:59:59+00:00"
                }
              }
            },
            {
              "terms": {
                "page.page_type": [
                  "home"
                ]
              }
            }
          ],
          "must_not": {
            "term": {
              "activeAds.timeViewed": 0
            }
          }
        }
      }
    }
  },
  "aggs": {
    "per_ad": {
      "terms": {
        "field": "activeAds.type"
      },
      "aggs": {
        "clicks": {
          "sum": {
            "field": "activeAds.clicked"
           }
        },
         "views": {
          "sum": {
            "field": "activeAds.viewed"
          }
        },
        "total_time_viewed": {
          "sum": {
            "field": "activeAds.timeViewed"
          }
        }
      }
    }
  }
}

Anyone have any idea on what I'm doing wrong?


(Colin Goodheart-Smithe) #2

What is the error or problem you are finding when running the 'blocks' aggregation? What do you expect to get instead?


(Jan van Vlimmeren) #3

I'm not getting an error but the aggregation is returning the total sum of all clicks in all blocks instead of specifically for the block in the 'parent' aggregation. The block "crosspromo__crosspromo-header" should have a different number of clicks than "regio__regio-sidebar":

{
"took": 23427,
"timed_out": false,
"_shards": {
    "total": 42,
    "successful": 42,
    "failed": 0
},
"hits": {
    "total": 1557373,
    "max_score": 0.0,
    "hits": []
},
"aggregations": {
    "per_block": {
        "doc_count_error_upper_bound": 1555695,
        "sum_other_doc_count": 223013931,
        "buckets": [{
            "key": "crosspromo__crosspromo-header",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }, {
            "key": "regio__regio-sidebar",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }, {
            "key": "she__she-news-head-4",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }, {
            "key": "she__she-news-head-5",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }, {
            "key": "she__she-news-head-6",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }, {
            "key": "she__she-news-head-7",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }, {
            "key": "she__she-news-head-8",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }, {
            "key": "she__she-news-head-9",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }, {
            "key": "she__she-promo",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }, {
            "key": "she__she-sidebar-1",
            "doc_count": 1557021,
            "clicks": {
                "value": 890242.0
            }
        }]
    }
  }
}

(Colin Goodheart-Smithe) #4

Aggregations bucket documents at each level so the per_block terms aggregation in your request is putting documents (the whole document) into buckets based on the value of the block.block and then the clicks aggregation will evaluate the sum of all the blocks.click values across all the documents that fall into each bucket.

If you need to perform aggregations on the inner objects you may want to look into Nested Objects Type and use the Nested Aggregation to concentrate parts of your aggregation request on the inner objects rather than the document as a whole.

Another option would be to index the blocks as individual documents, possibly copying the url, referrer, etc into each of those documents if you need the information on the per block level.


(Jan van Vlimmeren) #5

Aggregations bucket documents at each level so the per_block terms aggregation in your request is putting documents (the whole document) into buckets based on the value of the block.block and then the clicks aggregation will evaluate the sum of all the blocks.click values across all the documents that fall into each bucket.

That's just it, it's not evaluating the sum of all documents that fall into each bucket. It's doing the sum of all documents period. I'm trying to recreate what I'm doing for the "activeAds" where I am getting correct results. The documents have the same structure so by just changing the names of the fields, I should be getting similar results but I don't.


(Colin Goodheart-Smithe) #6

The example you posted at the top is a single document. I am going to take this document as an example. The document has multiple blocks so the document will "fall" into multiple buckets (it will be associated with each bucket that it has a block.block value for. So in the example document it will be associated with 6 buckets. Within each bucket the sum aggregation will take the document and sum up the values of blocks.clicks in the document. That means that for the example document in each bucket the document will contribute 6 values to the sum aggregation since there are 6 values for block.click in the document. If the blocks.blocks values overlap a lot between your documents (i.e. most of your documents contain the same blocks.blocks values, then the majority of your documents will fall into every bucket and the sum aggregation will sum up to the same value.

This is why you should separate the block objects, either by indexing them as separate documents or by indexing them as Nested Objects so you can use the Nested Aggregation to ensure they are assigned separately to buckets rather than all together in one combine document.


(system) #7