How Can I increase ES's indexing Data speed?Bulk can't achieve it!


(LDA) #1

i use bulkIndex or BulkProcessor Index Data ,and with 5 thread! my data are one milion items and 700M;
this takes two hours to index!
i think it's too long,how can better it ? how can i index faster?


(Christian Dahlqvist) #2

In order for anyone to be able to help you, you will need to provide more details. How are you performing your bulk indexing? What is the size of your bulk requests? What is the size and structure of your documents? What does your cluster look like? What version of Elasticsearch are you using? How many indices and shards are you indexing into?


(LDA) #3

sorry,!
my elastticsearch version is 1.7.
my cluster are two nodes,and i query data from orcale and foreach it
this is my code:
public static void ProceIndex(Client client, List plist){
BulkProcessor bulkProcessor = BulkProcessor.builder(client,
new BulkProcessor.Listener() {
long time1=System.currentTimeMillis();
long time2=0;
@Override
public void afterBulk(long arg0, BulkRequest arg1,
BulkResponse arg2) {
System.out.println("afterBulk bulk-----");
time2=System.currentTimeMillis();
System.out.println("总共索引:"+"条数据,耗时:"+(time2-time1)/1000+"秒----");
}
@Override
public void afterBulk(long arg0, BulkRequest arg1,
Throwable arg2) {
System.out.println("afterBulk bulk-----");
}
@Override
public void beforeBulk(long arg0, BulkRequest arg1) {
System.out.println("before bulk-----");
time1=System.currentTimeMillis();
}
})
.setBulkActions(10000)
.setBulkSize(new ByteSizeValue(50, ByteSizeUnit.MB))
.setFlushInterval(TimeValue.timeValueSeconds(5))
.setConcurrentRequests(10)
.build();
for(int i=0;i<plist.size();i++){
IndexRequest index=new IndexRequest("dfinder_perio", "perio");
bulkProcessor.add(index.source(JsonUtil.toJson(plist.get(i))).id(plist.get(i).getArticleId()));
}

        }

(Christian Dahlqvist) #4

I would recommend lowering the size of the bulk requests. Large bulk requests does not necessarily result in improved performance. Try setting it to a few thousand documents and a maximum size of around 5MB.

When you are indexing, what does the cluster look like? Are you saturating disk IO or possibly CPU? Do you see a lot of garbage collection occuring in the Elasticsearch logs? One trick to improve indexing performance for a temporary bulk load is to set the number of replisas to 0 during indexing and then increase it again once indexing has completed. This results in reduced load on the cluster during indexing at the expense of durability, but this can be a good tradeoff.


(LDA) #5

Thanks!
i tried to set replisas to 0 ,but is seems to not work really.
my yml config like this:
#index.analysis.analyzer.default.type : "ik"
cluster.name: elasticsearch
node.name: "node2"
transport.tcp.port: 9302
"number_of_replicas": "0"
"index.refresh_interval": "-1"


(Christian Dahlqvist) #6

Do not change the default values in the Elasticsearch config file to achieve this. Instead update theindex settings through the API. Change replicas to 0 for the index prior to starting the bulk job and then set it back to the default value once the job has finished.


(LDA) #7

Thanks for your advise.i just update the bulkActions and BulkSize and number_of_replicas, now my indexing seems like this,this size in the head page is changing when i refesh the page,but the docs is 0.?
is there any reason for it ?is it only show when the processing end ?

thanks!


(LDA) #8

and the logs in the console show that the work is indexing.
before bulk-----
当前页:163
afterBulk bulk-----
总共索引:,耗时:0秒----
before bulk-----
afterBulk bulk-----
总共索引:,耗时:0秒----


(LDA) #9

Thanks!i update this ,but seems not better!


(Christian Dahlqvist) #10

How is the cluster looking during indexing? Is there anything in the logs? How does CPU and disk IO look?

If performance does not increase when setting replicas to 0 and there is no identifiable factor limiting performance, I would recommend performing a separate indexing benchmark with similar documents, e.g. using Logstash with a file input, to see what the limit of the cluster is and make sure it is not the source system that is limiting throughput.


(LDA) #11

mybe in the for cycle,it query date from orcale DB,i select it every page and pagesize is 5000 items,my total datas is
26000000 items. select one page takes 100 seconds.
do you have any idea for deal it ?


(Christian Dahlqvist) #12

If Oracle is the bottleneck I am afraid I will not be able to help.


(system) #13