I need help on the below production scenario
Note : Cluster running on AWS ES with 2 nodes
The indices are hourly ones (24 in a day) with 5 Primary shards and 1 replica running on 2 nodes with 4GB heap space on each node
The existing template has some fields whose datatype defined as Text
- Change the indexrotation period to OneDay instead of OneHour
- Change the Primary Shards to 2 and 1 replica shards instead of 5:1 per index
3.Change the datatype of few fields to Keyword which currently exists as Text
As per my understanding this approach may address the issue however I need some expert advice on the same
Modify the rotation period to OneDay in kinesis Firehose delivery stream ES destination configuration
Create a new template with updated datatype which will be applicable for new Indices
Restore snapshot to take the backup of the old indices into new ones (approx 20 lacs docs are there)
Now my question is if I restore the snapshot to some new indices which I will create just to accomodate the data from old indices which were in hourly basis then there will be a conflict in data as the datatype will not be inline with the new indices.
In this case what should I do ? Do I need to re-index the data first before I restore and is that a feasible option on 20lacs of documents ?
That seems excessive. How large are these indices?
You will need to reindex to move from hourly to daily.
I'm not sure what 20lacs is.
How large are the indices currently? How long do you intend to keep the data in the cluster? If your retention period is relatively long it is possible that even 2 primary shards per day is excessive, but please read this blog post for some practical guidance.
Yes I have mentioned that in my approach wherein I am changing the hourly index to daily ones however when you say reindex I cannot reshard the cluster while reindexing right ?
If you see I have mentioned that I will be creating a new template to modify few of the datatype and that needs reindexing but what I am not confident on is that reindexing such a lot of documents (20 lacs approx.) with so many indexes might hamper the cluster state . Isn't ?
Can you suggest me something on this ?
Is 20 lacs equivalent to 2,000,000 documents? If so that does not sound like a lot of data so you might be able to use weekly or maybe even monthly indices rather than daily, at least as long as you do have a retention period longer than that. I do not see any problem reindexing that amount of data.
Yes the data is not that huge now but it will grow to 30-40GB in few weeks time as the no. of users will ramp up.
I have calculated the data growth and it will be 30-40GB of data per index if I change it to daily ones.
With 2 Primary shards it will be around 40 + 40 =80GB of data every day. That's the reason I have chosen 2 shards per index.
My retention period will be 6months. Currently the index size is small however I have calculated my data growth. And after few weeks one index will have around 30-40GB of data if I change it to daily ones with 2 primary shards.
So 1 shard will have 30-40GB of data which is acceptable according to elasticsearch.
I read the blog but unable to find out anything better than this. Can you suggest something more practical looking at the data growth ?
Also my current problem is reindexing the existing data to change the data type of few fields ? so before I restore I should be in a position to avoid the data conflict issue ?
I hope I was able to make you understand my problem ? Let me know if you need more information .
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.