The 2 approaches for rolling up accurately would be as follows:
- Examine all the data in your client code
- Use the aggs framework to summarise the data for you.
Option 1 involves using the scan/scroll API to stream the data sorted by your chosen summary dimensions (e.g. hour/website) and reducing in your client code before writing using the bulk API to a new index (see [1] )
Option 2 involves you repeatedly calling the aggs framework for a subset of the data e.g.
for all websites:
aggs call to get daily stats for website
This can be a lot of calls if your grouping field is high cardinality so one way of breaking it up into a smaller set of single requests is to adopt the hash/modulo approach outlined here [2]
Cheers
Mark
[1] https://www.elastic.co/elasticon/2015/sf/building-entity-centric-indexes and http://bit.ly/entcent
[2] Getting accurate cardinality for a field in single shard index