Hi Elasticsearch team, I have an application which generates ~ 1million document per sec, but mostly the documents are the same. Total unique documents are below 50 million. We have to use our own ID as doc ID (currently unique uint64).
Currently, I'm using BulkProcessor against a cluster include of 3 nodes (1 master, 2 data nodes). Each node has ~ 64G RAM and 20 cores.
I observe that when the data is empty, BulkProcessor was doing well, but when the data filled up, the insert performance degrades very fast, perhaps due to cost to find duplicate IDs. So I just want to confirm that:
- Is performance degradation reason is ID seeking ?
- If yes, is there any strategy to make it high performance in my case?