Timeline & Visualize Request Timeout after 30000ms in Kibana

Hey there,

I'm about 4 days new to ELK; very green to all this. Elasticsearch, kibana and logstash are all version 6.2.4, all running on one Debian 9.4 VM. I guess that means my cluster contains only a single node.

I've been experimenting with Elastiflow to use this stack for netflow data collection and analysis. 24 hours in, and I've collected over 57GB of flow data across over 65 million flow records (documents?)

root@docker:/home/jlixfeld# curl 'localhost:9200/_cat/indices?v'
health status index                 uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .kibana               GK1TSIs6SSen6HWo40SRTQ   1   0        218            4    141.9kb        141.9kb
yellow open   elastiflow-2018.04.22 NQkSZ44NQuKt5bw46KT1aw   2   1   40073930            0       20gb           20gb
yellow open   elastiflow-2018.04.21 L74S_otgTouE3lFnPW36Gg   2   1   29043943            0     14.5gb         14.5gb
root@docker:/home/jlixfeld#

Elastiflow has all sorts of dashboards, and I can pretty reliably look at a 4 hour time range in any of the dashboards. Trying to select a 12 hour time range or above, however, results in Kibana timeouts for Timeline and Visualize.

The VM has 4 vCPUs (2 x E5-2630 v4 @ 2.20GHz), and 64GB of memory allocated to it, and /var/lib/elasticsearch is on SSDs.

Load average seems kind of high when I try to run these dashboards, despite all the memory and the SSDs that I've thrown at this.

top - 13:39:33 up  1:02,  2 users,  load average: 7.86, 4.75, 4.49
Tasks: 169 total,   1 running, 168 sleeping,   0 stopped,   0 zombie
%Cpu(s): 64.8 us,  0.2 sy, 34.8 ni,  0.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 65982164 total, 18950704 free, 36122940 used, 10908520 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 29090220 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  667 elastic+  20   0 70.166g 0.034t 2.117g S 251.8 55.9  61:06.38 java
  618 logstash  39  19 5065348 1.339g 330020 S 146.8  2.1 115:01.27 java
  637 kibana    20   0 1263816  97728  21140 S   0.3  0.1   0:16.71 node

I've read some things on tuning ES, and as near as I can tell, Debian/systemd already has most of the performance tuning defaults set. For the rest, here's what I've got:

#/etc/default/elasticsearch: ES_JAVA_OPTS="-Xms32766m -Xmx32766m -Des.enforce.bootstrap.checks=true"

#/usr/lib/systemd/system/elasticsearch.service: LimitNOFILE=65536

#/etc/systemd/system/elasticsearch.service.d/override.conf: LimitMEMLOCK=infinity

#/etc/fstab: #/dev/mapper/docker--vg-swap_1 none            swap    sw              0       0

I've done no other tuning or configuration anywhere.

I'm not really too sure where to go from here to try and get it to perform better. The end game for this right now is to collect as much data as I can until I run out of space. I have more spinny disk capacity than I do SSD, so if I can get away with putting this on spinny disks, I could collect 1.5TB. If I/O turns out to be contributing to the bottleneck, I can get away with holding onto about 750GB. That said, I have experimented with both spinny disks and SSDs, and I still get the Kibana timeouts regardless.

This data is mostly unimportant (no need to be replicated somewhere else) and it won't be looked at very often. Maybe monthly. But it's important to be able to load dashboards reliably (and relatively quickly) when required.

Any insights?

Hi @jlixfeld,

If you open the Network tab in your browsers developer tools on the visualization page, you'll see a _msearch request. Copy the request payload and paste it into the Console (as the body for POST _msearch). Let me know if you still see the timeout. If so, I'd ping the ES team for support as it would be something entirely on their end. If you don't see the same timeout, then it is probably something within Kibana.

Thanks

It looks like CPU is quite saturated as you are hosting both Elasticsearch and Logstash on the same machine. Based on the Logstash CPU usage it also looks like you are doing a fair bit of indexing at the same time you are querying.

In addition to the data available through top, it would be interesting to see what disk I/O and iowait looks like when a long-running query is taking place.

Hi there Chris,

I'm having trouble getting the info you're asking for. The _msearch seems blank, but it's entirely likely I'm mis-interpreting your ask; this is probably the 3rd time in my life that I've tried to do something with dev tools on a browser.

Maybe the screenshot will help? :confused:

34%20AM

Hi Christian,

There doesn't seem to be any I/O wait during a long running search looking at iostat and top. I'm trying to install metricbeat to get some statistics around that.

Excellent. Just wanted to check. As you are using SSDs it should not generally be an issue.

Clearly your CPU is saturated. What is interesting is that it is Elasticsearch that is consuming so much CPU instead of Logstash. The Logstash codec that decodes flow records is fairly resource intensive and usually dominates CPU consumption on a single node install compared to the other components of the stack. So let's go through a few things...

  1. There is no way that you need even close to 32GB of heap for your data rates. Also it looks like you are dangerously close to the 32GB heap barrier above which issues can occur. Most users never go higher than 31GB just to be safe. That said, even though the recommendation of half of RAM for Elasticsearch heap persists, all of my experience suggests this is too much for most time-series use cases like flows. This is even more important when all components of the stack are on the same server. You certainly go down to 24GB, and likely as far down as 16GB. This frees up more RAM for the OS page cache to hold data in RAM and improve query performance.

  2. How much heap did you give to Logstash. For ElastiFlow, and given the RAM you have available, give it 2GB (both Xms and Xmx). Given the %MEM of Logstash I am guessing you have it set much lower.

  3. Your I/O Wait is 0% so that is fine. Investing in SSDs was a wise choice. If you go to spinning disk you will need multiple disks. A single drive would be I/O bound at your flow rates. You need either 4xHDD RAID-0 or 8xHDD RAID-10 (if you need redundancy).

  4. In Kibana advanced settings turn off highlighting (doc_table:highlight). It's negative impact on Kibana performance can be quite significant.

  5. Given your high CPU utilization it wouldn't surprise me if you are dropping incoming data due to full network buffers. This can happen due to back pressure when Elasticsearch or Logstash can no longer handle the incoming data. At 100% CPU utilization this is undoubtedly the case. Run netstat -su to check for this.

  6. Finally, I suspect you will need to consider moving Logstash to it own VM. You can start with giving it 4 vCPUs and perhaps increase that to 8. There isn't any reason to go beyond 8 vCPUs. Logstash doesn't scale very well vertically. You will be better off adding additional instances and spread the load across them.

I hope that helps.

Rob (creator of ElastiFlow)

Robert Cowart (rob@koiossian.com)
www.koiossian.com
True Turnkey SOLUTIONS for the Elastic Stack

Hi Rob,

I'm surprise that you're surprised by this. This tells me that my understanding of how ELK works is totally back-asswards :slight_smile:

I assumed that it was totally normal for elasticsearch to be using the majority of the CPU cycles, because it was being called by Kibana to present the dashboard data Kibana was asking for.

If I'm interpreting your mention about logstash, since it's taking the original flow in, and parsing it into the elastiflow record, that is why you'd suspect logstash would be more CPU intensive, or are there other reasons why that would be so?

Given my observations about how elasticsearch was consuming so much CPU, I started to tune based on those observations. That led me to throw a bunch of RAM at it, and adjust the heap. The heap barrier was relative to some documentation I reviewed which suggested that java 1.8 could safely handle the heap value of 32766 (now, for the life of me, I can't find that performance tuning doc I had referred)

I dropped this back down to 16G.

Logstash heap was at it's default settings. Based on my observations about where the load (mistakenly so, it seems?) was, and what I read about tuning, I didn't see any indication that there was a need to adjust logstash heap.

I adusted it to 2G.

Yeah. Given that I was previously using spinny disks with nfsen, and it taking forever to get any good results, SSDs were a given here.

Done (although it hasn't resolved the issue, but I don't expect it would have given the rest of this dialog)

Indeed:

root@docker:/etc/logstash/elastiflow/conf.d# netstat -su | egrep "packets received|packet receive errors|receive buffer errors"
    6935147 packets received
    1302525 packet receive errors
    1302525 receive buffer errors

I'll give this a shot, but the VMWare license I have only permits 4 vCPUs per VM, so if 4 isn't enough, I'll have to create two instances, and split the load between them.

Just to give you an idea from a customer project I recently deployed...

Elasticsearch CPU: 42.5%
Logstash CPU: 581.7%

They are ingesting 3000 flows/sec at the moment, and should be able to reach 6-7000/sec on the current server before we begin to expand the environment.

The server has 96GB of RAM, and 24GB dedicated to Elasticsearch JVM heap. This 24GB is arguably total overkill.

This chart is over 4 days and shows a beyond healthy heap (you want to see that regular saw tooth pattern).

NOTE: This deployment is actually the "big brother" of ElastiFlow, which leverages a more sophisticated schema which allows for seamless integration with firewalls, access logs, IDS logs and other "connection-related" sources. It also does more processing and enrichment of the data. So it requires more resources than ElastiFlow.

Interesting. Have you noticed any patterns in your experience that might suggest a somewhat reasonable fps:cpu_clock_cycle ratio that could be useful for helping determine when one should add more logstash instances?

... Or maybe that's an impossible question, given the number of variables - I/O, CPU bus speed, memory bus speed, etc?

I'm not generally a server guy, so please be gentle :slight_smile:

The netflow codec on the udp input is what uses the most CPU. The guidelines provided by the developer are:

CPUs Workers Flow/s
1 1 2300
2 2 4300
4 4 6700
8 8 9100
16 16 15000
32 32 16000

You can see how flows/core goes down as the number of cores increases. 32 cores is basically a waste of 16 cores. IMO the sweet spot is 4 (maybe sometimes 8) cores. That provides some concurrency while balancing the performance vs. deployment headache.

To really maximize performance and eliminate drops at high event rates you will probably have tune the Linux kernel networking parameters as well. For example the system I mention above was losing about 20% of incoming UDP packets until I applied some kernel tuning. Afterward it was almost totally eliminated. Of course that isn't even an issue to worry about until you get your CPU bottleneck solved.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.