I am quite new in ELK environment so trying to understand the concept and the best practices for a new project that i am responsible.
Needing some advices and overview about the indexing strategy for big data indexing.
I have 80 TB old data to index on ELK at the beginning and then will have around 50GB daily data. Whats the correct approach based on resource planning and indexing ? AS much as i understand the shard size is very critical in such plannings. Planning to have 50 GB sized shards. So that means i need 1600 shards for 80TB data and more for 50 GB daily data. When thinking so, i need huge resources. I saw warm and cold shards which are used for not frequent data to keep resources efficient.
If there is anyone who suggest me how to manage this much big data by resource friendly way, i would be really appreciated.
There are many different way people will suggest and you have to decide on your test and practice.
quite a few question to ask yourself.
how many system you will have in your cluster, how often this data are being query, how much storage each host will have, is there way you know what data are old what are new etc....
for example if you know out of 80tb, 90% is old and people will hardly access it then you can place them in warm storage with larger shard size. and put more frequent data in faster storage with less shard size and multiple of them.
one index can be divided in to multiple shard, rule of thumb from elastic is generally 30-50gig shard size.
what I have is a index ( with four shard each is 25gig) once data is old 30 days people are not scanning that much, we move them using ILM to warm storage and shrink them to two shard ( each is 50gig now)
I was thinking to use warm and cold shards too. So as a first step i will understand how much of my big data will be frequently used. Onu more question, is there any limitation in warm and cold shard size ? I mean is it possible to give 500 GB for a warm or cold shard ? Is there any potential risks or leaks to use that much sized shards ?
The size of the shard, the data and the queries run determines the minimum latency you can achieve. I do not think there is much benefit of going to jumbo shards. I would keep the shard size for warm and cold tiers at around 50GB and make sure to forcemerge down to a single segment. With much larger shards you move around very large chunks of data whenever the cluster needs to rebalance, which can cause problems.
I won't go more then 50gig shard size.
remember too many shard will cause problem as well.
there is 1000 shard per host is limit. after that you will start seeing some issue.
I found an option which named "mmapfs". As much as i understand it makes disk-based storage so it makes my #shards lower ?
For illustration,
If i index 80 TB data with 50 GB or less sized shards, then at least i will need 1600 shards.
I need 1.5 vcpu per shard (as recommended), so i need 2400 vcpu and same for replica its getting 4800vcpu.
Also 1600 shards / 25 = 64 GB RAM.
What if i index the data with mmapfs option, what difference it cause ?
If we make the simplified assumption that the amount of raw data will directly translate to size on disk (it can be smaller or larger depending on how you map your data ad what index settings you use) you generally want to have a replica configured for most shards, which would instead lead to 3200 shards. A lot of improvements have gone into recent version to make handling of large number of shards more efficient, so this number is in itself not a problem.
Where does this recommendation come from? It does not sound accurate for this type of use case.
I would recommend sticking with the default settings for this. I believe this is an expert setting that will not affect the sizing.
OpenSearch/OpenDistro are AWS run products and differ from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance.
(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns )
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.