Time based elastic search index deletion

We are planning to use elastic search to use for the log data.

We have 3TB data for one day. These logs are being generated in json format (they are kind of summaries). So we dont need logstash and are able to index documents directly to ES. We also want to persist only one day of data to start with. I have gone through curator which deletes indices older than some time frame. In all the examples, logstash generating time frame wise indices is mentioned. And this is key to deleting indices based on time.

How can I create such time based indices(same as Logstash) so that when I go for search query it will give aggregated result from all the indices and while deleting I can deleted just oldest one hour index without have to install logstash?

Logstash doesn't do anything "magical" with its indices and the behavior of adding/using time-based indices is easily done.

What most people do is set up your indices to be on the same granularity of your delete batch, because Elasticsearch is much more efficient at deleting whole indices than trying to do something like deleting by query. So if you plan to delete 1 hour at a time, then you may set indices that encompass 1 hour of data. That is, when you index a document into Elasticsearch, you'd do something like

POST /myindexprefix-2018-02-16.0700/doc
{
   "field1": "value1",
  ...
}

Then, once myindexprefix-2018-02-16.0700 is the maximum age, you delete the whole index which should be a very fast operation.

You can also query all of the indices at once by doing something like the following, given the previous example

GET /myindexprefix-*/_search
{
  ...
}

Elasticsearch will then resolve the * into all the indices matching the given pattern.

You can also get fancier if you want. Elasticsearch supports date math in the URL for the index names, so you can just use now, round it down to the nearest hour, and use that automagically in the index name: just use that linked doc for examples.

I will say that with hourly indices, you're going to want to be careful about total shard count. With hourly indices, you'd be creating 24 * (number of shards per index) * (number of unique prefixes) * (number of days of retention) shards. With hourly indices just a few different prefixes and a few weeks/months of retention, that number can get very large and very large shard counts can cause problems. This is one of the reasons why many people choose daily, weekly, or monthly retention (and thus daily/weekly/monthly index names) instead of hourly. As long as you're in a short retention period and not many prefixes and a low shard count, you should be fine, but you may want to reconsider if that changes.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.