Troubleshoot "low" Logstash -> ES indexing rate

Hello,

I have indexing issues with elasticsearch 5.1, recently.

In logstash 5.2, I have a lot of errors like that:

[2017-03-06T15:39:12,413][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$6@11b38fce on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@4a73e0e4[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 329787066]]"})
[2017-03-06T15:39:12,414][ERROR][logstash.outputs.elasticsearch] Retrying individual actions

When I increase the number of pipeline workers, nothing changes (from 1 -> 10 or 40).

I don't see any issue with hardware:

top - 15:53:38 up 34 days,  4:08,  3 users,  load average: 2.05, 1.63, 1.50
Tasks: 468 total,   2 running, 466 sleeping,   0 stopped,   0 zombie
%Cpu(s):  4.3 us,  0.8 sy,  0.0 ni, 94.6 id,  0.2 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 32644560 total,  2012804 free, 19934180 used, 10697576 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 11892140 avail Mem

iotop shows me that my IO usage is 7%

iostat:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
      10.80    0.00    1.41    0.13    0.00   87.67

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             502.00         0.00      8872.00          0       8872
sdb             204.00         0.00      2280.00          0       2280
dm-0              0.00         0.00         0.00          0          0
dm-1              0.00         0.00         0.00          0          0
dm-2           1703.00         0.00      8872.00          0       8872
dm-3            521.00         0.00      2280.00          0       2280

In jvm memory I have only 40% usage.

As I'm not limited by IO, CPU, RAM or JVM Memory, I don't see where is the bottleneck. My latency is ~0.30ms (0.20 when I don't send documents). Furthermore, my index rate has some "holes" every minute/minute and 30s (values at 0 doc/s).

I've read this page (https://www.elastic.co/guide/en/elasticsearch/reference/5.1/tune-for-indexing-speed.html) but I don't see where is the bottleneck.

I could of course increase the queue size in ES, but I've read that it's not a solution for constant throughput, as it only delays the issue and does not fix them.

Do you have any idea how I can troubleshoot that ?

Regards,
Grégoire

The problem is at ES end. That's what this part of the error log mean:

response code: 429 ({"type"=>"es_rejected_execution_exception"

ES is rejecting logstash submission. Increasing workers on logstash end is just going to make it worse.... you're beating ES harder :slight_smile:

Are you monitoring ES? What does ES log say? How are your ES nodes doing? Cluster healthy?

Depending on what your ES metrics tell you, you have a few options. Most likely adding more ES nodes.

Hello,

Yes, I'm using Marvel/Cerebro to monitor ES and I don't have anything in logs.

JVM memory never exceed 75%, CPU is not really used (~5%), I still have almost of my RAM and no IO issues. Cluster health is green with 55% of JVM Heap.

I really don't see what could be the issue here. What would slow down ES ? Is there an amount of document/s it's better to add more nodes ?

Regards,
Grégoire

What does your Elasticsearch output plugin config look like? Do you have more than one Elasticsearch output configured?

Hello,

I have only one:

elasticsearch {
                        index => "indice-name-%{+YYYY.MM.dd}"
                        hosts => ["http://host:9200"]
                }

Whether I send the requests to one host or all of them (I have a 10-nodes cluster) does not change the indexing rate.

Regards,
Grégoire

Are you running LS on ES? How many LS instances are you running? How many data ES nodes, client nodes, master nodes?

Hello,

I have only one logstash instance, which runs on a separate hardware. I have 10 ES datanodes, 2 of which are master nodes + 1 remote dedicated master node and 6 clients nodes (Kibanas)

Regards,
Grégoire

This sounds very odd. Are all the nodes in the cluster located in the same data centre? Why do you have a remote master node? How many shards are you indexing into? What is the specification of your data nodes?

Hello,

Indeed, all the nodes are in the same datacenter, on the same LAN but in two different racks. I'm indexing 20 primary shards/40 total shards - I have two disks for a node with a cluster. I also have ~300 other shards in other indices but I don't index on them.

Some specs for each of the 10 data nodes (they are all the same) :

32 Go RAM; 16Go Heap; 2 disks of 450 Go; 40 CPU cores. I don't know exactly the amount of IOPS but as I use 5%... I'm pretty sure there is not bottleneck there.

One of the most surprising thing is that I have holes in the graph of indexing rates, as if during several seconds I did not have any indexing at all.

Having an indexing latency of 0.3ms, with batch of 125 events, I expect to be able to be able to index ~416000 events/s (bacth_size/latency(seconds))

Where in the indexing process does ES is just waiting ?

I tried to change the index.refresh_interval to 2s, without any success.

Regards,
Grégoire

Unless you are indexing massive amounts of data every day, it sounds like you have oversharded quite a bit. I would recommend trying to reduce the number of shards to e.g. 5. With 1 replica configured, it should let all nodes participate in indexing. At the same time increase the internal batch size in Logstash to e.g. 1000. This will cause larger batches to be indexed, which in general is more efficient.

I would also like to know what indexing rate you are seeing if you look at your monitoring, and also if there are any other processes indexing into the cluster that could help fill up the queues.

That is a lot of shards. I am surprised that there is no telling messages in your ES logs. I would expect one or more ES nodes to hit long GC and possibly OOM.

Reducing your index.refresh_interval is making things worse :wink:

It's telling ES to do more I/O. Try increasing it instead, I have mine set to 60s during indexing. And as suggested by @Christian_Dahlqvist, increase your batch size.

Hello,

I was not clear : on each node, I have 2 primary shards + 2 replica shards (1+1 for each disk) for each indice. Only the last indice is indexing so only 2 primary shards are indexing on each node.

Which, finally, gives me 20+20 indexing shards + an amount of non-indexing shards depending on my data retention.

I'm currently doing tests increasing my batch_size to 1000

The index.refresh_interval default is 1s if I remember, so I increased it at 2s.

Regards,
Grégoire

I also have ~300 other shards in other indices but I don't index on them.

Are those non-indexing shards closed? If not, then they will use resources. Hence, our comments that there are too many shards.

I remember that refresh_interval used to be default to 5s (perhaps I remembered wrong), that's why I thought you lowered them. Regardless, 2s is too low IMHO. Recommendation is to set to -1 to disable or a fairly high value to reduce I/O while doing batch indexing.

Hello,

I'm currently running some benchmarks.

If I understood correctly, when I close an index, I prevent the read and write of this index, am I right ? That's not really an option actually : I have one index a day and I always want to be able to read data from 10 days ago event if the shards of these indices won't be updated ever again.

Btw, about the regular holes in the marvel graphs on Kibana, would that be an issue with Marvel/kibana or do I really have ~10 seconds of 0 updates every minute ? What could be the cause of this behaviour ?

Regards,
Grégoire

Hello,

After some benchmarks I have the following conclusions:

  1. When I increase the batch size on Logstash, I'm actually bounded by CPU on this machine. So i did some benchmarks on logstash to find that a batch size of 50 was very efficient, allowing me to process 22k event/s (without sending them to elasticsearch)

  2. In order to fix the errors, I took a look to the thread pool configuration and made the following changes:

  • I increase the number of CPU to 40 (I've read in the doc it's limited to 32 because of some max processes limits but I'm not concerned by these limits)
  • I increased the queue size to 500 (100 would have probably been enough).
  1. As 50 would provoke too many IO (I have between 1 and 2% of my CPU time in iowait), I increased it to 250, which seems to be the right equilibrium between logstash performances and ES ones, which allows me a little more than 16000 e/s (before all these changes, I was limited at 12000 - 13000 e/s).

Finally, 1 or 2% of iowait is not so much and I don't use my CPU so much neither. Would it make sense to add more processors to my bulk thread or is it not recommanded at all to exceeds the number of core ?

Thank you,
Regards,
Grégoire

Hello,

I figured out one of my issue with elasticsearch could be because of unnecessary IO. I index ~32000 event/s (2 logstash with 16000 event/s for each), each event being 0.5ko (I recently reduced it) which means less than 130Mbit/s. As I have 10 nodes and 20 indexing shards (+20 replica), I expect a few dozens of Mbit/s of network traffic on elasticsearch.

For the disk write, I expect : (130Mbit/s / 10) * 4 = 52Mbit/s, so ~6.5Mo/s on a node.

Finally, what I measure is between 30 and 40 Mbit/s in network usage and ~30Mo/s for disk (iotop).

I made my measures on a single node.

I am aware there is a mandatory overhead because of shards/cluster, but 80% of overhead seems a lot, so I guess I misconfigured something. I struggle to understand from where comes from this high write-disk usage.

Is there something fundamental I did not understand or a misconfiguration ?

Note : swapping is obviously disabled

Thank you,
Regards,
Grégoire

Have you tried changing your refresh rate? What do you get if you increased it by a lot? from 1s to 60s? What do you get if you disable it while indexing? e.g. set it to -1.

Short refresh rate mean more flushes and more I/O.

Tin

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.