How to achive optimal size of indexes for hot/warm/cold architecture

For some time we are using hot/warm/cold data nodes. We are indexing a large number of sources to daily/weekly or monthy indexes with logstash, trying to avoid having small indexes. From https://www.elastic.co/webinars/optimizing-storage-efficiency-in-elasticsearch I noticed that hot nodes should not keep longer-term data (monthly for example). But how to achieve this ... having daily indexes on hot nodes, monthly indexes on warm nodes? AFAIK it requires reindexing operations using curator with some kind of patterns for:

  • getting somerandomname-2018.12.04, somerandomname-2018.12.05, ...
  • reindexing to somerandomname-2018.12 while keeping this index somehow offline for users (so users will not get data 2x)
  • activate somerandomname-2018.12
  • remove daily indexes

How can this be achieved? I did not find any docs about handling this, especially for a large number of indexes. Manually choosing index names is not possible (which also means rollover is also unusable).

It sounds like you are creating a lot of indices with relatively low daily volumes, which means it will take time for them to reach a size suitable for longer term retention. Is this correct? How many types of indices are you creating? How much data is written to each per day?

If this is the case it probably makes sense to consider consolidating indices as reindexing as pointed out can be expensive.

Yes, but it depends. Indexes are named like "appcluster1-winevent-filebeat-6.5.3-2018.12.05" so I am able to set RBAC to these indexes by their names, so roleXY can access indexes named like appcluster1-*, also this means I can set different retention if I know that appcluster1 logs are required for a year for example.

This also mean that some indexes receive gigabytes of data per day, some kilobytes. For the small indexes I am currently using YEAR.MONTH pattern.

What kind of consolidation could I do?

This is a scenario where document-level security is very useful. I have seen users basically group data into indices by retention period and then define roles based on fields in the data. This is efficient and gives great flexibility around optimising shard sizes.

Without this I am not sure I can think of any easy solution if you have lots of varying streams you need fine grained access controls to.

I thinked about it but I am very afraid of performance here. Currently our indexes consume 20 TB of disk space (totally sum over all the nodes). Typical user is member of few roles which grants him access to very limited number of indexes. I have seen users sometimes using "*" pattern in Kibana without big performance penalties as Elasticsearch reads only couple of indexes.

Would not document-level security mean that Elasticsearch had to proceed all the indexes (probably just in-mem data as permissions would be controlled by keyword-fields) for each query? So in this case reading 2000 indexes instead of 10 for example.

You can consolidate into a number of shared indices based on retention and types of data. Exactly how to best do this will depend on your data. Doing so means it is likely that more and larger shards will be queried, but at the same time a lot of data will be filtered out and not processed.

I am afraid the index pattern we are using is the best possible in our case as for "appcluster1-winevent-filebeat-6.5.3-2018.12.05" we are having teams needing to access to all appcluster1-* indexes, team needed to access to all -winevent- indexes, also teams needed to access appcluster1-* without -winevent- indexes etc. on big number of data.

But this also leads to situation of having some number of small indexes and keeping monthly data on hot data nodes.
So now I am finding the ways to keep older data in bigger shards. What I can do now is to forcemerge all the indexes on cold data nodes as the cluster started to be slow which causes a large number of opened segments on cold data nodes. But next step is somehow making the shard bigger because this solves "just" "segments issue".

It is quite hard to solve it properly because the solution you mentioned would slow down the queries because of need to process the documents in indexes, on the other side having very big number of small indexes also slows down it.

From performance consideration it would probably best to have 20-40GB/shard indexes on hot nodes, up to 60 GB on warm nodes and maybe bigger indexes on cold nodes.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.