We have a small ES cluster which we upgraded from 1.7.0 to 2.3.1 a few days ago, and now the system can't keep up with the bulk inserts we throw at it. I know there are more issues like this in the forum, but nothing talked about there helped.
An idea what we're doing:
= 2.000.000.000 documents in the cluster
- about 200.000.000 inserts per day (an index per day (~400GB, 18 shards, 1 replica))
- Google cloud engine with remote spinning disks(!)
- only bulk inserts
- very low query volume
This setup used to hold up fine with 1.7, even with the remote spinning disks in a cloud setup, to our surprise, and we have ran it for about two years now (adding machines as needed).
After the upgrade to 2.3.1 the bulk inserts can't keep up. I've tried playing with the bulk size and that didn't make a difference.
- adding new machines doesn't really help. Looking at load graphs I see that it's usually a single machine which is at 90% for an hour or so, while the others are at a lower usage (60%). That switches to another machine, seemingly randomly.
- /_stats/store is really slow. It takes between 15 and 60 seconds to load (Kopf uses this resource).
- iostat, vmstat and similar tools don't show anything obvious. Disk IO is much lower than what we're allowed. 'wa' is mostly zero.
- there are a dozen bulk clients, all trying to insert to all machines at the same time, so it's not the clients overloading a single machine
- error logs are empty
Anyone any idea whan we can try? I've been looking at this forum, github issues, and the docs for the last few days, and nothing made any improvement.
(edits: fixed docs per day estimate)