We recently use ES to store monitor data and depend on self-generated id to remove duplicate data. Our ES configure is as follows:
3 node(24core, 128GB, 3T SSD)
After running 24h, the bulk performance is about 5x slower than the auto-generated id.
We have adjusted our id to: (timestamp/1000) + md5(monitor object fields) + (timestamp%1000), refering to the following blog:
Choosing a fast unique identifier (UUID) for Lucene
I would add that the split of the timestamp is mainly for storage consideration. Without this, the disk usage rised by 25% percent.
According to jstack and jvmtop, the main cpu resource is mainly consumed in docid lookup. We are trying to generate larger initial segment and speed up merge process, so that less segements would need to lookup.
I hope I have explained clearly my question, and any help is appreciated.