Bad bulk performance with self-generated id

Hi, all.

We recently use ES to store monitor data and depend on self-generated id to remove duplicate data. Our ES configure is as follows:

3 node(24core, 128GB, 3T SSD)
-Xms30g -Xmx30g
indices.memory.index_buffer_size: 15%
index.store.throttle.type: none

After running 24h, the bulk performance is about 5x slower than the auto-generated id.

We have adjusted our id to: (timestamp/1000) + md5(monitor object fields) + (timestamp%1000), refering to the following blog:
Choosing a fast unique identifier (UUID) for Lucene
I would add that the split of the timestamp is mainly for storage consideration. Without this, the disk usage rised by 25% percent.

According to jstack and jvmtop, the main cpu resource is mainly consumed in docid lookup. We are trying to generate larger initial segment and speed up merge process, so that less segements would need to lookup.

I hope I have explained clearly my question, and any help is appreciated.

Have you plotted indexing throughput as a function of shard/index size? Is your data arriving in near real-time so that timestamps are largely sequential? How large portion of your data end up being updates?

The data is mainly monitor data, which arrive continuously.
Only very small portion will be updated.

How large are the shards after the 24 hours?

We hava 27 index. For some big index, each has multiple shards which is about 10GB, for small index, each will have 3 shards.

@Christian_Dahlqvist Any better idea?

What is the output of the cluster stats API? What is the size of your bulk requests?

Also, which Elasticsearch version are you using?

If I understand the scenario correctly what you may be seeing is the added cost of doing a read on a growing pile of docs with every write. If you supply the ids there's no way for you to tell us "trust me, this doc doesn't exist". We'll always have to check that id is not present already.
If you update rarely then maybe use auto generated ids and add a "my_id" field and use update by query on that for those rare scenarios where you need to update. It'll be slower to update and you lose the only-one guarantee but the inserts should be faster.

Each bulk request is 2k events. The cluster state is not available, what information do you want.

We depend on id to remove duplicate data. so we always have to query by the id.

You can have a field called "my_id" (or whatever) and query on that.
Pros :
Elasticsearch couldn't care less about what you put in there so writes need not be slowed down with uniqueness checks for my_id values.
Cons :
Elasticsearch couldn't care less about what you put in there

It is possible to run a query for all docs with my_id:foo and either delete them or update them in order to perform an update. The downside of this is that unlike the elasticsearch-managed id field:

  1. By default there is no fast-routing that knows which shard foo is on - all shards must be searched.
  2. There are no guarantees that the index doesn't contain 2 docs with my_id:foo if your client app inserts the same data twice (forgetting to delete or update any prior doc).

(Perhaps worth pointing out this technique resolved a performance issue for a user indexing all-the-tweets-in-the-world using the original tweet id. That was an earlier version of elasticsearch and id lookups may have improved since but a design with no lookups will always be faster than one that requires them.)

I am not asking for the cluster state, just cluster stats (statistics) to get on overview of the cluster.

Do you have monitoring installed?

The cluster state is not available

Why?

Thanks for the reply. We have just stop the test today, I reproduce the environment latter and supply the cluster state.

Not needed as @Christian_Dahlqvist said but I was just curious about why you said that.

BTW, any advice about:

  1. How to generate larger initial segment?
  2. How to merge floor segments first?

I’d not try to change elasticsearch behavior. Why do you think defaults are not good?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.