Elasticsearch index performance with mostly duplicate document

Hi Elasticsearch team, I have an application which generates ~ 1million document per sec, but mostly the documents are the same. Total unique documents are below 50 million. We have to use our own ID as doc ID (currently unique uint64).

Currently, I'm using BulkProcessor against a cluster include of 3 nodes (1 master, 2 data nodes). Each node has ~ 64G RAM and 20 cores.

I observe that when the data is empty, BulkProcessor was doing well, but when the data filled up, the insert performance degrades very fast, perhaps due to cost to find duplicate IDs. So I just want to confirm that:

  • Is performance degradation reason is ID seeking ?
  • If yes, is there any strategy to make it high performance in my case?

Which version of Elasticsearch are you using? Are you using custom document IDs in order to avoid duplicates?

We're using latest 6.4.2. And yes, we set document ID manually to avoid duplicate.

Then you will be updating the same document frequently, potentially multiple times per bulk request. Whenever Elasticsearch identifies that a document to be updated is still in the transaction log, it will trigger a refresh, which is an expensive operation. This will reduce indexing throughput dramatically and cause a lot of disk I/O.

Elasticsearch is not, and can as far as I know not be, optimised for this use case, so I would recommend trying to remove as many duplicates as possible before indexing into Elasticsearch.

@Christian_Dahlqvist Thanks for detail information. Actually

Then you will be updating the same document frequently

I set OP_TYPE to CREATE which mean reject request if document existed AFAIK

 final IndexRequestBuilder idb = connection
                            .index(index, MY_INDEX)

So even with the case use CREATE, it still trigger something expensive?

I do not know. Have you tested whether it makes a difference?

I still suspect comparing against an in-memory map and dropping duplicates before indexing would be much faster and more efficient.

Yes, I just did 2 scenario test:

    1. Index from the start with empty index
    1. Index will nearly full index

In 1 scenario, the bulk processing is very stable and fast. In 2 scenario, it's fast at first few minutes and extremely slow after.

Actually we're implementing a kind of "already indexed" caching, but to manage the cache along with TTL (our data has TTL) is quite annoying (especially in distributed environment)
And, you still have some "moment" which cache and elasticsearch is not allign with each other, and there will still come reindex already indexed data. So just want to ask is there a better settings for that case.

Adding some more information, we're using elasticsearch to manage tagging of time series database. As you already know, TSDB will come at very fast pace, but the metadata itself is not going to change much.

Thanks for supporting anyway :).

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.