Hello all,
i'm following up on this comment on github and the following suggestion.
I have a cluster of 8 blades, 32G RAM and Xeon Processors.
I receive, from a kafka topic, a lot of documents that need to be inserted (or updated, if they're already existing) into an index that is rotated weekly. I use a custom document ID for this purpose.
The documents are DEDUPED on a 24h time window. This means that i receive, for each document, at most a single update every day.
The document rate is quite high and at the beginning of the week i can handle around 10K bulk updates per second. This load is handled quite easily by the cluster, even if i use a HOT-WARM approach, dedicating 3 blades to the hot weekly index.
Even using the Bulk Update API, this pattern collapses into a Bulk INSERT because each document is not existing in the index.
After exactly 24 hours, that is as soon as i start receiving new data for existing documents, the insert rate drops to 400/500 queries per second, rendering the cluster useless and unsuitable for the purpose. Also, the CPU skyrockets and the iowait on the indexing blades goes very high.
I have tried to modify the number of shards, the number of nodes dedicated to the hot indexing, trying to forcemerge the index before inserting more documents but nothing seems to approach the performance that i need.
I don't really know how to address this problem which seems to be an issue since ES 5.$something.
How can i address this issue ?
I have read that i may try to use index instead of update, prepending a get for each codument that i need to update.
Do any of you experts have some suggestions ?
thanks for your time!