After running 24h, the bulk performance is about 5x slower than the auto-generated id.
We have adjusted our id to: (timestamp/1000) + md5(monitor object fields) + (timestamp%1000), refering to the following blog: Choosing a fast unique identifier (UUID) for Lucene
I would add that the split of the timestamp is mainly for storage consideration. Without this, the disk usage rised by 25% percent.
According to jstack and jvmtop, the main cpu resource is mainly consumed in docid lookup. We are trying to generate larger initial segment and speed up merge process, so that less segements would need to lookup.
I hope I have explained clearly my question, and any help is appreciated.
Have you plotted indexing throughput as a function of shard/index size? Is your data arriving in near real-time so that timestamps are largely sequential? How large portion of your data end up being updates?
If I understand the scenario correctly what you may be seeing is the added cost of doing a read on a growing pile of docs with every write. If you supply the ids there's no way for you to tell us "trust me, this doc doesn't exist". We'll always have to check that id is not present already.
If you update rarely then maybe use auto generated ids and add a "my_id" field and use update by query on that for those rare scenarios where you need to update. It'll be slower to update and you lose the only-one guarantee but the inserts should be faster.
You can have a field called "my_id" (or whatever) and query on that. Pros :
Elasticsearch couldn't care less about what you put in there so writes need not be slowed down with uniqueness checks for my_id values. Cons :
Elasticsearch couldn't care less about what you put in there
It is possible to run a query for all docs with my_id:foo and either delete them or update them in order to perform an update. The downside of this is that unlike the elasticsearch-managed id field:
By default there is no fast-routing that knows which shard foo is on - all shards must be searched.
There are no guarantees that the index doesn't contain 2 docs with my_id:foo if your client app inserts the same data twice (forgetting to delete or update any prior doc).
(Perhaps worth pointing out this technique resolved a performance issue for a user indexing all-the-tweets-in-the-world using the original tweet id. That was an earlier version of elasticsearch and id lookups may have improved since but a design with no lookups will always be faster than one that requires them.)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.