Hi Elasticsearch team, I have an application which generates ~ 1million document per sec, but mostly the documents are the same. Total unique documents are below 50 million. We have to use our own ID as doc ID (currently unique uint64).
Currently, I'm using BulkProcessor against a cluster include of 3 nodes (1 master, 2 data nodes). Each node has ~ 64G RAM and 20 cores.
I observe that when the data is empty, BulkProcessor was doing well, but when the data filled up, the insert performance degrades very fast, perhaps due to cost to find duplicate IDs. So I just want to confirm that:
Is performance degradation reason is ID seeking ?
If yes, is there any strategy to make it high performance in my case?
Then you will be updating the same document frequently, potentially multiple times per bulk request. Whenever Elasticsearch identifies that a document to be updated is still in the transaction log, it will trigger a refresh, which is an expensive operation. This will reduce indexing throughput dramatically and cause a lot of disk I/O.
Elasticsearch is not, and can as far as I know not be, optimised for this use case, so I would recommend trying to remove as many duplicates as possible before indexing into Elasticsearch.
In 1 scenario, the bulk processing is very stable and fast. In 2 scenario, it's fast at first few minutes and extremely slow after.
Actually we're implementing a kind of "already indexed" caching, but to manage the cache along with TTL (our data has TTL) is quite annoying (especially in distributed environment)
And, you still have some "moment" which cache and elasticsearch is not allign with each other, and there will still come reindex already indexed data. So just want to ask is there a better settings for that case.
Adding some more information, we're using elasticsearch to manage tagging of time series database. As you already know, TSDB will come at very fast pace, but the metadata itself is not going to change much.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.