Pre-aggregate docs on indexing

Hi,

We are currently indexing around 1B docs/day which is becoming very inefficient when searching.

A suggested solution was to index every document in to its 1m, 1h, 1d intervals (Aggregated) so when searching, we can direct searches to only the highest interval and search less documents.

Basically it means that for every doc like this:

{
  "timestamp": "2017-09-25T12:02:25.000Z",
  "dimension": "value",
  "metric": 10
}

We will index 3 documents like this:

// per minute
{
  "timestamp": "2017-09-25T12:02:00.000Z",
  "dimension": "value",
  "metric": 10
}
// per hour
{
  "timestamp": "2017-09-25T12:00:00.000Z",
  "dimension": "value",
  "metric": 10
}
// per day
{
  "timestamp": "2017-09-25T00:00:00.000Z",
  "dimension": "value",
  "metric": 10
}

And for the next document, we want to only update the "metric" field and leave all the dimensions the same.

The way we are approaching this is by setting the id of each aggregated doc as the JSON representation of all the dimensions of the document concatenated by timestamp.

We understand that there could be some conflicts when multiple processes are trying to update the same document, so we will set retry_on_conflict high enough in Logstash to help mitigate this.

Is this a good approach? Are there any downsides that we should think about now?

Other than doing the aggregation client side before sending to Elasticsearch, or taking the raw counts into index A and then aggregating into a new, separate index I'd say so.

But what version are you on, what do you mean by "becoming very inefficient when searching"?

This sounds quite inefficient to me. The reason for this is that updating a document in Elasticsearch is more expensive than indexing a new document and it sounds like you will be replacing indexing a single document by 3 updates.

Another approach could be to just index the data into a raw index like you do now, and then every so often have an external process that aggregates data written over the last period and writes it to the separate indices. This means that the indices that cover a specific interval will lag a little, but will result in considerably fewer updates.

Hey @warkolm & @Christian_Dahlqvist,
Thanks for the replies.

Today we are indeed doing the index raw & aggregate using an external process method.
Its working fine but it has its own issues:

  1. If the raw data is lagging, we need to make sure we re-run the aggregation process.
  2. At some point the aggregation process becomes the bottle neck. Because we have multiple dimensions, we need to basically run an aggregation of a per-interval per-aggregation which creates a big table that we then need to index.
    That aggregation query is heavy on its own, and sometimes the aggregation job takes more time than it aggregates, which means it increases its lag instead of lowering it.

I understand that doing it like this (With the updates) is more expensive, thats mostly the reason I asked this question in the first place.
The question is how much more expensive is it? 10%? 50%? As I'm using daily indices (Maybe I can also use hourly indices for the per-hour aggregations), the lookup is over a relatively small index.

We are currently running on 2.4 but we have deployed 6.0-beta2 and are sending all traffic to both clusters so we can measure the performance and see what needs to be changed. (This question related to 6.0-beta2 only).

About the "very inefficient", I mean that queries for more than a few hours take a long time (More than 10s).
Its mostly due to IOPS limit on Azure that unfortunately we don't have a way to work around (Other than adding more machines).

How many dimensions do you have? How and how often are you currently aggregating data? This can probably be done in a few different ways, so it may be possible to find a more efficient approach.

I suspect replacing a single index operation with 3 updates would result in more than 3 times the indexing load. Frequent updates should also result in more extensive merging. Benchmarking this should however be reasonably easy and would give a much more accurate answer.

At the moment we have 8 dimensions.
The aggregation actually helps reduce the number of docs to search for quite a bit (~6m -> ~300k), so I expect search performance will also increases considerably.

I am still testing it out trying to see where is (If any) the bottle neck

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.