As it panned out, we've written a fairly nasty (speaking frankly) shell script to calculate date rounding for either current or backfill periods, build a single unified query per interval, and send modified output back to logstash and ultimately to a separate (client-facing) ES cluster. We pull data with a top-level terms agg to split by client, then various sub aggs to as required by the eventual visualisations.
Despite the ugliness of the approach, it's achieving what we hoped. Queries to the raw data are light, and response times against the new index have dropped, in some cases (on equivalent queries), from around 40s to around 50ms. The precomputed index is now about 50gb compared to about 40TB for the raw data. It's still early and we're expecting these numbers to improve further before it goes live. For example we'll try to get the index size down to something that can be stored entirely in RAM.
We've used the fingerprint
filter in logstash to enable safe replay of logs and "at least once" delivery. Every 5m we query for the 2 previous 5m windows, potentially updating the penultimate window in the process.
Terms aggs now suffer much the same problem as is encountered with sharding, even though we're using a single shard, but we attempt to reduce the error by pulling twice the number of terms per interval as we'll actually query (the ratio will be subject to tuning later). Cardinality and sum values (count in the source data become sum in the new data) seem to benefit more unambiguously from our approach - completely accurate, and far, far quicker.
Some queries over longer periods can still can still take a few seconds. It's likely we'll do a second round of precomputation to create daily 'summaries', so that a query of eg 2 years hits ~700 records, rather than ~200,000. In one concrete example, calculating cardinality of client IPs over a year for a particular site (with cardinality of 4 million) takes about 12 seconds. Previously, that would've brought our cluster to its knees! So it's an improvement. But eg precomputing daily summaries from the 5-minutely data would make it trivial. We haven't tested precomputation of hashes, but from what I've read it will make little difference to a numeric field.
We chose to store one record per-client per-interval, which appears to be more performant than alternatives but does mean reindexing everything if we change the structure. Having a separate type
for each type of data we want would've been a lot more flexible, but we figure that being able to issue a single query for all the data using, and therefore using a single filter, is worth it.