ELK Indexing Strategy

ARDA_ASLAN · May 17, 2023, 1:34pm

Hello Elastic Community,

I am quite new in ELK environment so trying to understand the concept and the best practices for a new project that i am responsible.

Needing some advices and overview about the indexing strategy for big data indexing.

I have 80 TB old data to index on ELK at the beginning and then will have around 50GB daily data. Whats the correct approach based on resource planning and indexing ? AS much as i understand the shard size is very critical in such plannings. Planning to have 50 GB sized shards. So that means i need 1600 shards for 80TB data and more for 50 GB daily data. When thinking so, i need huge resources. I saw warm and cold shards which are used for not frequent data to keep resources efficient.

If there is anyone who suggest me how to manage this much big data by resource friendly way, i would be really appreciated.

Thank you,
Arda.

elasticforme · May 17, 2023, 1:58pm

Hi Arda, welcome

There are many different way people will suggest and you have to decide on your test and practice.

quite a few question to ask yourself.
how many system you will have in your cluster, how often this data are being query, how much storage each host will have, is there way you know what data are old what are new etc....

for example if you know out of 80tb, 90% is old and people will hardly access it then you can place them in warm storage with larger shard size. and put more frequent data in faster storage with less shard size and multiple of them.

one index can be divided in to multiple shard, rule of thumb from elastic is generally 30-50gig shard size.

what I have is a index ( with four shard each is 25gig) once data is old 30 days people are not scanning that much, we move them using ILM to warm storage and shrink them to two shard ( each is 50gig now)

ARDA_ASLAN · May 17, 2023, 2:16pm

Thank you very much for your explanations.

I was thinking to use warm and cold shards too. So as a first step i will understand how much of my big data will be frequently used. Onu more question, is there any limitation in warm and cold shard size ? I mean is it possible to give 500 GB for a warm or cold shard ? Is there any potential risks or leaks to use that much sized shards ?

Christian_Dahlqvist · May 17, 2023, 2:20pm

The size of the shard, the data and the queries run determines the minimum latency you can achieve. I do not think there is much benefit of going to jumbo shards. I would keep the shard size for warm and cold tiers at around 50GB and make sure to forcemerge down to a single segment. With much larger shards you move around very large chunks of data whenever the cluster needs to rebalance, which can cause problems.

elasticforme · May 17, 2023, 2:40pm

I won't go more then 50gig shard size.
remember too many shard will cause problem as well.
there is 1000 shard per host is limit. after that you will start seeing some issue.

ARDA_ASLAN · May 17, 2023, 2:45pm

Thank you Sachin & Christian. Really appreciated.

elasticforme · May 17, 2023, 2:51pm

also make sure to use data stream if your data is time series and always in append mode.

ARDA_ASLAN · May 17, 2023, 3:30pm

Thx.

I found an option which named "mmapfs". As much as i understand it makes disk-based storage so it makes my #shards lower ?

For illustration,
If i index 80 TB data with 50 GB or less sized shards, then at least i will need 1600 shards.
I need 1.5 vcpu per shard (as recommended), so i need 2400 vcpu and same for replica its getting 4800vcpu.
Also 1600 shards / 25 = 64 GB RAM.

What if i index the data with mmapfs option, what difference it cause ?

Christian_Dahlqvist · May 17, 2023, 4:04pm

If we make the simplified assumption that the amount of raw data will directly translate to size on disk (it can be smaller or larger depending on how you map your data ad what index settings you use) you generally want to have a replica configured for most shards, which would instead lead to 3200 shards. A lot of improvements have gone into recent version to make handling of large number of shards more efficient, so this number is in itself not a problem.

Where does this recommendation come from? It does not sound accurate for this type of use case.

I would recommend sticking with the default settings for this. I believe this is an expert setting that will not affect the sizing.

ARDA_ASLAN · May 21, 2023, 6:31pm

I saw it in this article

system · May 21, 2023, 6:31pm

OpenSearch/OpenDistro are AWS run products and differ from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns )

Christian_Dahlqvist · May 21, 2023, 6:46pm

That is not about Elasticsearch so I would not rely on that.

If you are using Opensearch I would recommend reaching out to that community about that recommendation.

system · June 18, 2023, 6:46pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing best practice Elasticsearch	4	453	December 23, 2020
Shard Recommendation for Elasticsearch Elasticsearch	4	320	July 6, 2017
Advice on Elasticsearch Architecture design Elasticsearch	4	506	April 13, 2020
Need advice on shards for my index Elasticsearch	15	938	September 30, 2020
Advice for restore/configuration Elasticsearch	6	521	April 7, 2019

ELK Indexing Strategy

Related topics