Shard Sizing question

Hello Group,
I am currently sizing my production cluster and had some questions. PLease help me out with some pointers. Based on my research shard size should not exceed more than 50 GB to perform optimally. Below is my scenario

  1. I have 5 nodes with SSD 100 GB each node and 16GB RAM each node
  2. We will have about 450GB of logs to be processed each month
  3. We dont want to store these documents for long as the raw files will available in some cold storage if we ever need them and can take it through adhoc indexing if needed
    Based on these criteria, I am thinking of below
  4. Create an index with 10 shards and 1 replica -- this will mean i need to have minimum 1TB storage (450 * 2) correct ?
  5. Allow 8gb RAM for heap on each node making the total heap size to be 40GB -- is this good enough or will create problem for GC ?
  6. Create ILM policy to rollover after 1month and delete the old index

Please let me know if these are good enough to begin with or are there other things that I need to consider ?
thanks
rags

Given that you have relatively little disk per node and a reasonably short retention period I would recommend using daily indices as that allows you to keep a rolling X number of days that fits the amount of storage you have. As long as you are using a single daily index you can probably set up 5 primary shards to get an even write distribution even though 1 would likely be sufficient.

If you have a replica data will be stored twice for high availability, so you may need more disk space and/or shorter retention period. Make sure you follow these guidelines.

I do not understand this. Can you please explain?

That is the recommended best practice assuming nothing else is running on these nodes.

As mentioned, use daily indices instead as that gives you better flexibility. Note that data is removed by deleting complete indices, so if you used monthly indices you would remove a months worth of data at a time.

Thanks Christian for the info. What I mean by 450*2 was that if I use one replica and 10 primary shards then thats a total of 20 shards correct. I use 10 shards to keep the shard size around 40gb as my total is 450gb per month and adding a replica for each primary means that i nead double of 450GB which is close to 1TB!

Also the reason I initially planned for 1 large index was to keep number of indexes lower as I read somewhere on the blog that large number of indexes (and associated shards) can have an impact on performance during query, merge and other behind the scene processes related to lucene index. Let me know if I shouldnt worry too much about index numbers as I have short retention timeframe ?
Also to provide more info about these logs they are harvested from IoT devices and have specific mapping inside index.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.