Hi,
We are currently indexing around 1B docs/day which is becoming very inefficient when searching.
A suggested solution was to index every document in to its 1m, 1h, 1d intervals (Aggregated) so when searching, we can direct searches to only the highest interval and search less documents.
Basically it means that for every doc like this:
{
"timestamp": "2017-09-25T12:02:25.000Z",
"dimension": "value",
"metric": 10
}
We will index 3 documents like this:
// per minute
{
"timestamp": "2017-09-25T12:02:00.000Z",
"dimension": "value",
"metric": 10
}
// per hour
{
"timestamp": "2017-09-25T12:00:00.000Z",
"dimension": "value",
"metric": 10
}
// per day
{
"timestamp": "2017-09-25T00:00:00.000Z",
"dimension": "value",
"metric": 10
}
And for the next document, we want to only update the "metric"
field and leave all the dimensions the same.
The way we are approaching this is by setting the id
of each aggregated doc as the JSON representation of all the dimensions of the document concatenated by timestamp.
We understand that there could be some conflicts when multiple processes are trying to update the same document, so we will set retry_on_conflict
high enough in Logstash to help mitigate this.
Is this a good approach? Are there any downsides that we should think about now?