How to achive optimal size of indexes for hot/warm/cold architecture

vb4t · December 22, 2018, 12:50am

For some time we are using hot/warm/cold data nodes. We are indexing a large number of sources to daily/weekly or monthy indexes with logstash, trying to avoid having small indexes. From https://www.elastic.co/webinars/optimizing-storage-efficiency-in-elasticsearch I noticed that hot nodes should not keep longer-term data (monthly for example). But how to achieve this ... having daily indexes on hot nodes, monthly indexes on warm nodes? AFAIK it requires reindexing operations using curator with some kind of patterns for:

getting somerandomname-2018.12.04, somerandomname-2018.12.05, ...
reindexing to somerandomname-2018.12 while keeping this index somehow offline for users (so users will not get data 2x)
activate somerandomname-2018.12
remove daily indexes

How can this be achieved? I did not find any docs about handling this, especially for a large number of indexes. Manually choosing index names is not possible (which also means rollover is also unusable).

Christian_Dahlqvist · December 22, 2018, 1:02am

It sounds like you are creating a lot of indices with relatively low daily volumes, which means it will take time for them to reach a size suitable for longer term retention. Is this correct? How many types of indices are you creating? How much data is written to each per day?

If this is the case it probably makes sense to consider consolidating indices as reindexing as pointed out can be expensive.

vb4t · December 22, 2018, 1:10am

Yes, but it depends. Indexes are named like "appcluster1-winevent-filebeat-6.5.3-2018.12.05" so I am able to set RBAC to these indexes by their names, so roleXY can access indexes named like appcluster1-*, also this means I can set different retention if I know that appcluster1 logs are required for a year for example.

This also mean that some indexes receive gigabytes of data per day, some kilobytes. For the small indexes I am currently using YEAR.MONTH pattern.

vb4t · December 22, 2018, 1:11am

What kind of consolidation could I do?

Christian_Dahlqvist · December 22, 2018, 1:20am

This is a scenario where document-level security is very useful. I have seen users basically group data into indices by retention period and then define roles based on fields in the data. This is efficient and gives great flexibility around optimising shard sizes.

Without this I am not sure I can think of any easy solution if you have lots of varying streams you need fine grained access controls to.

vb4t · December 22, 2018, 1:35am

I thinked about it but I am very afraid of performance here. Currently our indexes consume 20 TB of disk space (totally sum over all the nodes). Typical user is member of few roles which grants him access to very limited number of indexes. I have seen users sometimes using "*" pattern in Kibana without big performance penalties as Elasticsearch reads only couple of indexes.

Would not document-level security mean that Elasticsearch had to proceed all the indexes (probably just in-mem data as permissions would be controlled by keyword-fields) for each query? So in this case reading 2000 indexes instead of 10 for example.

Christian_Dahlqvist · December 22, 2018, 1:44am

You can consolidate into a number of shared indices based on retention and types of data. Exactly how to best do this will depend on your data. Doing so means it is likely that more and larger shards will be queried, but at the same time a lot of data will be filtered out and not processed.

vb4t · December 22, 2018, 1:57am

I am afraid the index pattern we are using is the best possible in our case as for "appcluster1-winevent-filebeat-6.5.3-2018.12.05" we are having teams needing to access to all appcluster1-* indexes, team needed to access to all -winevent- indexes, also teams needed to access appcluster1-* without -winevent- indexes etc. on big number of data.

But this also leads to situation of having some number of small indexes and keeping monthly data on hot data nodes.
So now I am finding the ways to keep older data in bigger shards. What I can do now is to forcemerge all the indexes on cold data nodes as the cluster started to be slow which causes a large number of opened segments on cold data nodes. But next step is somehow making the shard bigger because this solves "just" "segments issue".

It is quite hard to solve it properly because the solution you mentioned would slow down the queries because of need to process the documents in indexes, on the other side having very big number of small indexes also slows down it.

vb4t · December 22, 2018, 2:02am

From performance consideration it would probably best to have 20-40GB/shard indexes on hot nodes, up to 60 GB on warm nodes and maybe bigger indexes on cold nodes.

system · January 19, 2019, 2:02am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to set-up Hot, Warm and cold for single node Elasticsearch? Elasticsearch ilm-index-lifecycle-management	9	1602	March 10, 2022
Slow querying of elasticsearch logs Elasticsearch	6	398	October 18, 2022
Configuring ILM hot cold delete policy on ES Cluster Elasticsearch ilm-index-lifecycle-management	6	1035	February 13, 2021
Warm storage of large (9TB) log data archives Elasticsearch	3	982	July 5, 2017
Hot Cold architecture question? Elasticsearch	7	1899	May 13, 2020

How to achive optimal size of indexes for hot/warm/cold architecture

Related topics