If you wanted to do the analytics offline you could use the scroll API to stream all the data out of Elasticsearch and do your calculations on your client.
The problem with your approach here is that you are potentially going to need to stream a lot of data from the shards to the coordinating node because the number of timestamps could be large. You could mitigate this in two ways:
- When indexing documents use
routingto route documents with the sameccPairsvalue to the same shard - This way you are guaranteed that all timestamps for a term bucket are on the same shard. You will still need to do complex processing though so will likely still need to use the scripted_metric aggregation. - Have a secondary index where each document represents a
ccPairsvalue and contains the information about whether there is a gap and at what timestamps - This involves having a job that runs periodically, collects new data from the primary index and merges that new data into the relevant documents int eh secondary index. At query time you can than run normal aggregations to obtain the information you need if you structure the documents in the secondary index in appropriate ways to show this data.
If your data volumes are small enough that your current approach (after you add a reduce script) seems to be working well then you can continue with this approach but it might be worth keeping the above in mind if your data volume increases and performance starts to suffer.