Bulk Indexing performance questions

  1. I am trying to send a bulk of 10k requests of 1.5 kb each using python elasticsearch from 20 separate client threads to a single ES node. I am seeing that my requests timed out with the following error.
    ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host=u'xx.xx.xx.xx', port=9200): Read timed out. (read timeout=30))

So, I changed the client default timeout value to 30s. Now, requests are timing out after a long time.
a. Is there any deterministic way of calculating bulk size if my requests are of same size throughout?
b. Are there any metrics or stats that can indicate the bulk index failure in the future? Can I query them via api ? Which cluster parameters are correlated with this?
c. I was verifying the thread pool queue size when the bulk index request failed. Queue size was around 300 while my queue is set to 1000. I assumed that bulk request failed because of queue getting full. But that was not the case. Does 1 bulk request(of X docs) occupy 1 spot in the queue?

  1. In order to improve search performance, I am planning to use routing for fields. I don't see any way to put routing value in kibana. What is the best way to use routing in kibana? Also, I noticed that kibana always use search api . In this case, does routing actually matter? (since all shards have to be queried for all requests from Kibana) Please correct me if i am wrong

As each bulk request is around 15MB, I suspect you are overloading the cluster with that many concurrent requests as it is only a single node. I would recommend trying with smaller bulk sizes, e.g. 1000-2000 documents per request, and/or reducing the number of concurrent indexing threads. Once you have reached a level where you can index without timeouts you can start slowly increasing parameters until you find an optimum. There is no point throwing that much data at Elasticsearch if it is not able to process it in a timely manner. You can naturally also scale up and/or out your cluster to make it more performant.

Not sure I understand what you are looking for. Can you please clarify?

Have you got X-Pack monitoring installed? This will allow you to see what goes on in the cluster.

@Christian_Dahlqvist

Is there any deterministic way of calculating bulk size if my requests are of same size throughout

I meant, Can I calculate the optimal bulk size using any formula(using parameters like queue size etc) if my request size is always same. Since, no parameters are changed dynamically during my benchmark run, I don't understand why requests are timed out after 1-2 hours.

Have you got X-Pack monitoring installed? This will allow you to see what goes on in the cluster.

Yes, I have X- Pack installed. But, I couldn't find any correlated metrics in the displayed graphs that exactly shows the reason why requests are timing out. When I run my application, I want to detect beforehand if bulk request will timeout or if I am reaching the limits of ES. Can you help me in finding metrics or stats that can alert me this situation so that I can take necessary action(reduce the load)?

Are Index memory and segment counts metrics important here?

Benchmarking is the best way to determine this. Is there anything in the logs, e.g. related to GC or merging activity, around the time it slows down? Are you supplying your own document IDs or are you allowing Elasticsearch to assign them?

@Christian_Dahlqvist
No I couldn't find anything specific. In the monitoring display, total index memory is shown to 143 MB while total segments count per node is around 1000. Machine has 32 G heap and 8 core cpu.

Forgot to add. I am not adding my own document IDs. I am allowing Elasticsearch to assign them

How many shards are you actively indexing into? What kind of storage do you have? What does disk I/O and iowait look like?

@Christian_Dahlqvist i am writing into 10 indices each with 5 shards. I have a local ssd for which dd shows around 300MB/s.

Most of the times, Iowait is neglible with 90% cpu utilization.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          93.44    0.00    3.78    0.00    0.00    2.77

    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    xvda              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
    nvme0n1           0.00     0.00 1391.00    0.00  5692.00     0.00     8.18     0.19    0.14    0.14    0.00   0.02   2.40

Sometimes iowait increases to 30% when disk utilization is 100%( i see too many reads/sec during that time )

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          42.07    0.00    2.27   20.53    0.00   35.14

    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    xvda              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
    nvme0n1           0.00     0.00  593.00 2264.00  2744.00 248156.00   175.64   428.00  180.39   80.03  206.68   0.35 100.00

Is it because of merges happening in the background?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.