Hi @Srinath_C
We did reduce the the bulk size, we brought it down to a max of 5Mb as per suggestions on this list and also found it contributing to the overall stability - especially in reducing the number of bulk request rejections.
OK - I'd perhaps reduce each bulk request size more. Just for some background: you send a bulk request to the coordinating node. It splits it into a bulk request for each involved shard. The shard level bulk is executed serially.
If instead, you have more and smaller bulks, you will end up with more shard-level bulk requests, each one doing its job serially, but they run in parallel. Each bulk request takes up less memory (because they're smaller), both for the request and for the response.
If you have a high bulk queue size, you're just using up lots of memory to queue up bulks, when you should really handle that application side. If you're getting bulk rejects, then it means that ES isn't keeping up and you should back off in your application and retry later.
Reducing the index buffer size from 50% to 20% also frees up 30% of your heap space, which means that ES has more room to handle memory spikes from merging/querying/whatever.
I guess in case of SSDs it makes sense to retain:
Definitely set indices.store.throttle.type: none
.
I would, however, delete all the merge settings. These are expert settings and are not easy to reason about. I also don't think they're the source of your problem. Setting index.merge.policy.max_merged_segment
to 2gb
is just going to result in more segments, which will slow down search. If merges are not keeping up, then Elasticsearch will throttle indexing down to one thread, which should cause your application to back off and allow things to catch up.
Honestly, just sorting out your memory issues will cause a massive improvement because the JVM won't be spending lots of time trying to find a few extra bytes.
You may also want to look at changing the linux scheduler from CFQ to noop
. While CFQ is supposed to do the right thing, SSDs are not always detected correctly by the OS and quite often you can get better SSD throughput with the noop
scheduler.
We haven't actually experimented using an index for every day,
Definitely worth doing. Expiring documents means marking them as deleted, then doing a merge to remove deleted documents. Probably not as bad as it could be, because you're probably dropping whole segments, but an index per day will definitely be more efficient.