I am quite new to Elasticsearch.
We are migrating Data from a Licensed Product to Elastic.
But the amount of data is huge, its about 100 TB/month.
And we have to index data for 10 years. So effectively 1 Petabyte of Data.
The primary objective of indexing the data is to perform a search .
Each data is associated with(has a field as Date) date and each search is associated with Date or Date Range.
I am planning to create Index on per day basis using ILM, so that the Index size is manageable (approx. 3 TB).
Also, if the Index size is around that size, I can create shards around 20-50 GB (Will do performance testing for the shard size).
And as the searches have a date or date range so I can do a targeted search against those Indices , instead of searching against huge Indexes.
But that would create around 3650 Indices. Is 3650 Indices recommended in Elastic or should I reduce the number of Indexes to 84 by creating Indices per month instead of per day ?
But that was the size of the Index would be huge (100 TB).
Any Suggestion is welcome.
3650 indices will have then around 46 shards each.
The "problem" is quite similar at the end. How to hold 168 000 shards in your cluster?
There are many strategy to have.
If you have to keep all the data on hot (and expensive) nodes, with a rule of thumb of 20 shards per Gb of HEAP, you will need around 8 400 gb of HEAP (or 16 800 if you want to have replication). With 30 gb of HEAP per data node, that will be 280 data nodes (or 560 nodes with replicas). Quite a lot...
If you don't have to keep all this data available immediately, I'd suggest to look at searchable snapshots. It's available with trial and enterprise licenses. I'd encourage you to speak with elastic about that so a solution architect could tell you how much nodes you would need with that strategy but my guess is that's much less than 560 data nodes!
Hi @dadoonet ,
Thanks for responding . I do understand that eventually everything drills down to the number of shards, which in either case would be same .
But while we perform the search , at that time, we would see the difference, as while searching we can pass the Index(Or Range of Indices) against which we need to search, and since the shards within the Index would be lesser in number for daily created Index , then compared to monthly created Indices, the search response would be faster. That is my thought, I might be wrong here.
Note that with datastreams (and ILM), you should not really think about it. Shards are "smart" enough to not have to even think about this.
Once you have reached the threshold of let say 50gb per shard, the rollover policy will create the new shard for you.
All shards will be "aware" of the first and last dates of the events they are holding, so it's easy for Elasticsearch to skip them if don't have the data inside.
Hi @dadoonet , Thanks for the suggestion. Its really helpful.
Also , if we have more than 1 TB of data getting indexed per day. Is it good recommendation to have separate Index Nodes(node.data=true, so those node will only have the primary shards) and Separate Search node (node.data= false; so those Nodes will have only the Replica shards) .
That way the searches can have affinity towards the "search Node", which would be faster as Indexing is happen on the "Index Nodes".
Let me know if the understanding is correct .
At index time primary and replica shards to largely the same amount of work so what you are suggesting is not possible. You can also not control where primary shards reside as Elasticsearch can change this due to issues in the cluster or relocations.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.