@dadoonet: sure bulk request would help but we are actually receiving high amount of updates so incremental indexing matters. For nightly indexing bulk indexing is the right option as you pointed out. Thanks for your mention of 12k docs per sec. I now have something to compare with.
@jasontedor: About the fsync of translog.
As I mentioned earlier, on ES 1.4.5 with default config options I got ~1ms per doc.
With ES 2.3.3 with default config (index.translog.durability = request) I got ~3ms per doc.
With ES 2.3.3 with default config (index.translog.durability = async) I got ~1.6ms per doc.
Sure with async the times came down but not to the level of ES 1.4.5.
Also, I will post the hot threads related to my bulk ingestion at the earliest...
Yes, it probably would help but then it's not an apples-to-apples comparison. I really want to understand why the performance drop is so steep here, 3x seems too high to be explained by the per-request fsyncs alone. It could be, I don't know, maybe the disks are really slow. But I want to make sure that there is nothing else here so we should just keep things constant.
You have nothing to compare with because you don't know the size of the documents, how many nodes there are, how many client threads are writing, what the underlying hardware is, whether or not there are dynamic mapping updates occurring, and many other extremely relevant variables.
Right, and we would expect that to be the case if there's more to the performance drop here than just the per-request fsyncs.
also i want to know why everything is same but the es version, the indexing time will be 3x. i want to fix it , or i had to use the old es version es-1.7.1 not the 2.3.2
@tingking23 I wonder if this is due to some difference in how we index things. I wonder if you could go back to your 1.x version and create a new index with all the mappings but don't index documents into it. Then upgrade this to 2.x and run your indexing test? I really wanna know if we changed something in mappings that make these things go nuts? we do a lot of different things based on the version the index was created on.
Are you making refresh = -1 while doing indexing ?
Also, confirm whether the index.translog.durability = async has really been set or there was some issue while setting. I can't think of anything else for now... why you are not getting the gain though..
Hot threads while doing bulk indexing of real data won't be possible for now as we are presently incremental indexing heavy and a 5-6hr nightly indexing job is fine for now.
However, I can write a bulk indexing code for dummy data and post the hot threads. I believe that should be fine.
Hot threads are related to a moment. How many samples of hot_thread responses would be good for you ?
this my code to create index:
json = "{"settings":{"refresh_interval":"-1","number_of_replicas":"0"},"mappings":{""
+ pro.getProperty(Conf.ES_TYPE)
+ "":{"date_detection":false,"_all":{"enabled":false},"properties":{"t02_cust_relation":{"properties":{"cust_relation":{"type":"nested","properties":{"r_id_nbr":{"type":"string"},"r_phone_nbr":{"type":"string"},"r_source":{"type":"string"},"r_time":{"type":"string"},"r_type":{"type":"string"}}}}},"t01_cust_base":{"properties":{"cust_id_nbr":{"type":"string","index":"not_analyzed"}}}}}}}";
i try it also , but it does not work.
i changged the API bulk to bulkprocess and the time come down to 1.5h,
but it also slow than es1.7.1 . old es1.7.1 use almost 1h.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.