We are trying to do a reindex of one of our live indexes which contains 9.6 million docs and is 48 gig in size.
Elasticsearch is running on 1 machine with 36 CPU cores and 60gig ram. I gave Elasticsearch 30gig as heap size (in the jvm.options file).
We first tried to do this on a backup of this machine which worked perfectly; it took around 30 minutes to reindex all data.
Now we did the same on our production machine and it started out great; it did about 7 million documents on just under 30 minutes but then it slowed down so much it took 10 minutes to do 1 batch, finally resulting in elasticSearch not returning results anymore and reindex just seem to be stopped. The task was still active though.
After cancelling the task I can see the index has build 8 million docs so it was pretty close, unfortunately it is not usable like this.
Some more info:
- We are running ES 5.3.0
- New index has refresh interval set to -1 to speed up indexing.
- Reindex happens in batches of 5000.
- Machine is using about 450% CPU (100% = 1 core) and about 9g ram during the reindex.
- I'm seeing alot of these errors:
"[2018-03-06T08:34:58,699][WARN ][o.e.m.j.JvmGcMonitorService] [BcXdUDQ] [gc] overhead, spent [1.1s] collecting in the last [1.7s]"
I suppose this is the issue? Any way I can prevent these and get my reindex through? The machine is heavy enough to easily do this I think.