we run a very simple docker based 1node-setup of elastic 5.4.x to aggregate some logfiles. We have the luxury problem (which probably a lot of ppl have at some point) that elastic is working out of the box without any problems and without the need to configure early on. However, this means we have a real problem now: we gathered one and half year of logfiles (atm slightly above one billion records) in:
daily indices
5 shards each
You can imagine the consequences
So what did I do: Read about the reindex API and reindex all older daily-ones into monthly ones. Started with 2017-MM-* but the number of documents was too high and I ran into queue errors. So I instead I wrote a script which does the reindexing for each day of a month. (Pretty much YYYY-MM-01 .. 29/30/31 -> YYYY-MM) Wasn't really fast but it worked.
At least for 2017, but when arriving in 2018-01 I had a lot of trouble with garbage collecting. We have a server metric for this: it literally never happened for the 2017 records (about 500 million records, 1140 shards). It is a real pain for 2018 (also about 500 million records and ~1000 shards atm). Nothing really changed besides increasing the number of cores (from 6 to 18 now reduced to 12 again). Reindexing becomes slower and slower to a point where it took about an hour to complete 3 million records.
What is causing this trouble? I tried two things:
sleeping for some minutes between the days
setting the refresh_interval to infinite
Both did not help.
Hardware is a virtual machine with 12 cores atm (was 6 when it worked but this resulted in very heavy load so I increased the number of cores) and 20GB RAM.
Yeah I know we have too many shards, that's the reason for the reindex
But as I wrote: I already reduced the number of shards from more then 2.000 to below 1.000 (and will be even lower when most of 2018 is finally reindexed into monthly indices).
What bugs me is the fact, that our server could reindex 2017 (when it was more then 2.000 shards) without any trouble but can't right now with below 1.000.
I can't tell but 1000 is still a lot for a single node IMO.
GC cycles is probably a good indicator of that. Each shard is a Lucene instance and consumes resources.
Do you need to keep all those old indices around?
What is the typical size of a single shard?
What is the maximum of shards you would run on a simple single node?
Our plan is to keep the old indices for about a year. We need it this long.
We will implement cronbased aggregation (reindexing of daily indices to monthly ones) of the previous month (maybe at the 10th of each month for safety (to still have daily indices for some days back)) and deletion of very old monthly indices (older then 12 months back)
By keeping the default 5 shards/index this would be around 55 shards in the end for one year + daily indicies for current month so probably (~40*5): ~255shards in total.
How can I find out the typical size of a shard? We currently have ~500gb of data (in filesystem). I already know that 5 shards per index is overkill for this amount of data, but if our server can handle it without trouble I would like to keep it this way.
This and all those questions are in the link I provided earlier.
Anyway, a typical size for shard for daily indices like logs is between 25gb to 50gb depending on your testing.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.