Bad bulk performance with self-generated id

ginger · October 10, 2017, 10:18am

Hi, all.

We recently use ES to store monitor data and depend on self-generated id to remove duplicate data. Our ES configure is as follows:

3 node(24core, 128GB, 3T SSD)
-Xms30g -Xmx30g
indices.memory.index_buffer_size: 15%
index.store.throttle.type: none

After running 24h, the bulk performance is about 5x slower than the auto-generated id.

We have adjusted our id to: (timestamp/1000) + md5(monitor object fields) + (timestamp%1000), refering to the following blog:
Choosing a fast unique identifier (UUID) for Lucene
I would add that the split of the timestamp is mainly for storage consideration. Without this, the disk usage rised by 25% percent.

According to jstack and jvmtop, the main cpu resource is mainly consumed in docid lookup. We are trying to generate larger initial segment and speed up merge process, so that less segements would need to lookup.

I hope I have explained clearly my question, and any help is appreciated.

Christian_Dahlqvist · October 10, 2017, 10:46am

Have you plotted indexing throughput as a function of shard/index size? Is your data arriving in near real-time so that timestamps are largely sequential? How large portion of your data end up being updates?

ginger · October 10, 2017, 11:29am

The data is mainly monitor data, which arrive continuously.
Only very small portion will be updated.

Christian_Dahlqvist · October 10, 2017, 11:32am

How large are the shards after the 24 hours?

ginger · October 10, 2017, 11:33am

We hava 27 index. For some big index, each has multiple shards which is about 10GB, for small index, each will have 3 shards.

ginger · October 11, 2017, 5:48am

@Christian_Dahlqvist Any better idea?

Christian_Dahlqvist · October 11, 2017, 6:14am

What is the output of the cluster stats API? What is the size of your bulk requests?

Also, which Elasticsearch version are you using?

Mark_Harwood · October 11, 2017, 6:52am

If I understand the scenario correctly what you may be seeing is the added cost of doing a read on a growing pile of docs with every write. If you supply the ids there's no way for you to tell us "trust me, this doc doesn't exist". We'll always have to check that id is not present already.
If you update rarely then maybe use auto generated ids and add a "my_id" field and use update by query on that for those rare scenarios where you need to update. It'll be slower to update and you lose the only-one guarantee but the inserts should be faster.

ginger · October 11, 2017, 8:07am

Each bulk request is 2k events. The cluster state is not available, what information do you want.

ginger · October 11, 2017, 8:16am

We depend on id to remove duplicate data. so we always have to query by the id.

Mark_Harwood · October 11, 2017, 8:28am

You can have a field called "my_id" (or whatever) and query on that.
Pros :
Elasticsearch couldn't care less about what you put in there so writes need not be slowed down with uniqueness checks for my_id values.
Cons :
Elasticsearch couldn't care less about what you put in there

It is possible to run a query for all docs with my_id:foo and either delete them or update them in order to perform an update. The downside of this is that unlike the elasticsearch-managed id field:

By default there is no fast-routing that knows which shard foo is on - all shards must be searched.
There are no guarantees that the index doesn't contain 2 docs with my_id:foo if your client app inserts the same data twice (forgetting to delete or update any prior doc).

(Perhaps worth pointing out this technique resolved a performance issue for a user indexing all-the-tweets-in-the-world using the original tweet id. That was an earlier version of elasticsearch and id lookups may have improved since but a design with no lookups will always be faster than one that requires them.)

Christian_Dahlqvist · October 11, 2017, 8:37am

I am not asking for the cluster state, just cluster stats (statistics) to get on overview of the cluster.

Do you have monitoring installed?

dadoonet · October 11, 2017, 8:46am

The cluster state is not available

Why?

ginger · October 11, 2017, 9:48am

Thanks for the reply. We have just stop the test today, I reproduce the environment latter and supply the cluster state.

dadoonet · October 11, 2017, 10:00am

Not needed as @Christian_Dahlqvist said but I was just curious about why you said that.

ginger · October 12, 2017, 2:40am

BTW, any advice about:

How to generate larger initial segment?
How to merge floor segments first?

dadoonet · October 12, 2017, 4:35am

I’d not try to change elasticsearch behavior. Why do you think defaults are not good?

system · November 9, 2017, 4:35am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch index performance with mostly duplicate document Elasticsearch	7	1710	November 22, 2018
Bulk insertion taking long and throwing lots of errors Elasticsearch	6	456	July 6, 2017
Indexing performance degradation over time Elasticsearch	19	3308	December 19, 2017
Possible collision with child document _id generation? Elasticsearch	5	925	July 6, 2017
The bulk API becomes much slower once I have indexed 300,000 documents into elasticsearch? Elasticsearch	9	4605	July 5, 2017

Bad bulk performance with self-generated id

Related topics