Elasticsearch index performance with mostly duplicate document

huydx · October 25, 2018, 5:51am

Hi Elasticsearch team, I have an application which generates ~ 1million document per sec, but mostly the documents are the same. Total unique documents are below 50 million. We have to use our own ID as doc ID (currently unique uint64).

Currently, I'm using BulkProcessor against a cluster include of 3 nodes (1 master, 2 data nodes). Each node has ~ 64G RAM and 20 cores.

I observe that when the data is empty, BulkProcessor was doing well, but when the data filled up, the insert performance degrades very fast, perhaps due to cost to find duplicate IDs. So I just want to confirm that:

Is performance degradation reason is ID seeking ?
If yes, is there any strategy to make it high performance in my case?

Christian_Dahlqvist · October 25, 2018, 6:03am

Which version of Elasticsearch are you using? Are you using custom document IDs in order to avoid duplicates?

huydx · October 25, 2018, 6:07am

We're using latest 6.4.2. And yes, we set document ID manually to avoid duplicate.

Christian_Dahlqvist · October 25, 2018, 6:11am

Then you will be updating the same document frequently, potentially multiple times per bulk request. Whenever Elasticsearch identifies that a document to be updated is still in the transaction log, it will trigger a refresh, which is an expensive operation. This will reduce indexing throughput dramatically and cause a lot of disk I/O.

Elasticsearch is not, and can as far as I know not be, optimised for this use case, so I would recommend trying to remove as many duplicates as possible before indexing into Elasticsearch.

huydx · October 25, 2018, 7:16am

@Christian_Dahlqvist Thanks for detail information. Actually

Then you will be updating the same document frequently

I set OP_TYPE to CREATE which mean reject request if document existed AFAIK

 final IndexRequestBuilder idb = connection
                            .index(index, MY_INDEX)
                            .setId(id)
                            .setSource(source)
                            .setOpType(OpType.CREATE);
bulkProcessor.add(idb.request());

So even with the case use CREATE, it still trigger something expensive?

Christian_Dahlqvist · October 25, 2018, 8:02am

I do not know. Have you tested whether it makes a difference?

I still suspect comparing against an in-memory map and dropping duplicates before indexing would be much faster and more efficient.

huydx · October 25, 2018, 8:20am

Yes, I just did 2 scenario test:

1. Index from the start with empty index
1. Index will nearly full index

In 1 scenario, the bulk processing is very stable and fast. In 2 scenario, it's fast at first few minutes and extremely slow after.

Actually we're implementing a kind of "already indexed" caching, but to manage the cache along with TTL (our data has TTL) is quite annoying (especially in distributed environment)
And, you still have some "moment" which cache and elasticsearch is not allign with each other, and there will still come reindex already indexed data. So just want to ask is there a better settings for that case.

Adding some more information, we're using elasticsearch to manage tagging of time series database. As you already know, TSDB will come at very fast pace, but the metadata itself is not going to change much.

Thanks for supporting anyway :).

system · November 22, 2018, 8:20am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bad bulk performance with self-generated id Elasticsearch	17	3393	November 9, 2017
_bulk not indexing all documents? Elasticsearch	5	353	July 6, 2017
Bulk insertion taking long and throwing lots of errors Elasticsearch	6	456	July 6, 2017
Multiple documents with the same _id Elasticsearch	4	821	July 6, 2017
Avoiding duplicate documents with versioning Elasticsearch	5	430	July 6, 2017

Elasticsearch index performance with mostly duplicate document

Related topics