Logstash elasticsearch output plugin takes extremely long time

Could someone help identify the problem and advice? Thanks in advance.
The following logstash stats are obtained from the logstash management API:
curl –X GET http://127.0.0.1/_node/stats
Cluster configuration is as follows:
1 coordinating node,
1 master node,
2 data nodes
Logstash & elasticsearch version 5.3.1
CentOS 7 host
Each node runs on VMware VM, 2vCPU, 32GB RAM (16GB allocate to elasticsearch), 1TB HDD.
Logstash runs on coordinating node, and elasticsearch output directs to coordinating node, too.
Daily index, 5 shards, 1 replica.
Input is syslog UDP from Sophos firewall. 1400 events/second.
If I comment out elasticsearch output plugin, all events can be consumed.
But when I put elasticsearch output plugin to work, throughput degrades to 130 events/second.
We can see in the stats that elasticsearch output plugin duration is 1847441 millis, compared to second longest KV filter, which takes 38864 millis.

{
"host": "ELK3-163281032",
"version": "5.6.3",
"http_address": "127.0.0.1:9600",
"id": "...f",
"name": "ELK3-163281032",
"jvm": {
"threads": {
"count": 25,
"peak_count": 26
},
"mem": {
"heap_used_percent": 11,
"heap_committed_in_bytes": 284299264,
"heap_max_in_bytes": 1056309248,
"heap_used_in_bytes": 120045568,
"non_heap_used_in_bytes": 96202840,
"non_heap_committed_in_bytes": 102518784,
"pools": {
"survivor": {
"peak_used_in_bytes": 17432576,
"used_in_bytes": 14982408,
"peak_max_in_bytes": 17432576,
"max_in_bytes": 17432576,
"committed_in_bytes": 17432576
},
"old": {
"peak_used_in_bytes": 106202536,
"used_in_bytes": 74550752,
"peak_max_in_bytes": 899284992,
"max_in_bytes": 899284992,
"committed_in_bytes": 127275008
},
"young": {
"peak_used_in_bytes": 139591680,
"used_in_bytes": 30512408,
"peak_max_in_bytes": 139591680,
"max_in_bytes": 139591680,
"committed_in_bytes": 139591680
}
}
},
"gc": {
"collectors": {
"old": {
"collection_time_in_millis": 2312,
"collection_count": 25
},
"young": {
"collection_time_in_millis": 5762,
"collection_count": 140
}
}
},
"uptime_in_millis": 501144
},
"process": {
"open_file_descriptors": 57,
"peak_open_file_descriptors": 60,
"max_file_descriptors": 4096,
"mem": {
"total_virtual_in_bytes": 3753644032
},
"cpu": {
"total_in_millis": 236320,
"percent": 37,
"load_average": {
"1m": 0.39,
"5m": 0.41,
"15m": 0.53
}
}
},
"pipeline": {
"events": {
"duration_in_millis": 1927823,
"in": 67748,
"out": 67246,
"filtered": 67746,
"queue_push_duration_in_millis": 966360
},
"plugins": {
"inputs": [
{
"id": "...",
"events": {
"out": 67748,
"queue_push_duration_in_millis": 966360
},
"name": "udp"
}
],
"filters": [
{
"id": "...",
"events": {
"duration_in_millis": 12652,
"in": 67746,
"out": 67746
},
"matches": 67746,
"patterns_per_field": {
"message": 1
},
"name": "grok"
},
{
"id": "...",
"events": {
"duration_in_millis": 1939,
"in": 67746,
"out": 67746
},
"name": "mutate"
},
{
"id": "...",
"events": {
"duration_in_millis": 0,
"in": 0,
"out": 0
},
"name": "drop"
},
{
"id": "...",
"events": {
"duration_in_millis": 0,
"in": 1,
"out": 1
},
"name": "mutate"
},
{
"id": "...",
"events": {
"duration_in_millis": 2481,
"in": 67746,
"out": 67746
},
"name": "mutate"
},
{
"id": "...",
"events": {
"duration_in_millis": 38864,
"in": 67746,
"out": 67746
},
"name": "kv"
},
{
"id": "...",
"events": {
"duration_in_millis": 0,
"in": 27,
"out": 27
},
"name": "mutate"
},
{
"id": "...",
"events": {
"duration_in_millis": 3565,
"in": 67746,
"out": 67746
},
"name": "mutate"
},
{
"id": "...",
"events": {
"duration_in_millis": 1456,
"in": 67746,
"out": 67746
},
"name": "ruby"
},
{
"id": "...",
"events": {
"duration_in_millis": 2544,
"in": 67718,
"out": 67718
},
"name": "mutate"
},
{
"id": "...",
"events": {
"duration_in_millis": 6244,
"in": 67746,
"out": 67746
},
"name": "mutate"
},
{
"id": "...",
"events": {
"duration_in_millis": 2705,
"in": 67746,
"out": 67746
},
"matches": 67746,
"name": "date"
}
],
"outputs": [
{
"id": "...",
"events": {
"duration_in_millis": 1847441,
"in": 67746,
"out": 67246
},
"name": "elasticsearch"
},
{
"id": "...",
"events": {
"duration_in_millis": 589,
"in": 67246,
"out": 67246
},
"name": "stdout"
}
]
},
"reloads": {
"last_error": null,
"successes": 0,
"last_success_timestamp": null,
"last_failure_timestamp": null,
"failures": 0
},
"queue": {
"type": "memory"
},
"id": "main"
},
"reloads": {
"successes": 0,
"failures": 0
},
"os": {
"cgroup": {
"cpuacct": {
"usage_nanos": 19844737287582,
"control_group": "/"
},
"cpu": {
"cfs_quota_micros": -1,
"control_group": "/",
"stat": {
"number_of_times_throttled": 0,
"time_throttled_nanos": 0,
"number_of_elapsed_periods": 0
},
"cfs_period_micros": 100000
}
}
}
}

What does CPU usage, disk I/O and iowait look like on the Elasticsearch data nodes while indexing? What does CPU usage on the Logstash node look like when outputting to Elasticsearch via the coordinating node?

Hi, Christian
CPU usage on the Logstash node is around 40%.
I will try to collect more stats from data node next Monday for your reference.

BTW, will elasticsearch throttle logstash output while it is busy?

Thanks for your kind help.

Logstash will only output as fast as the slowest output, although it can buffer internally if you have a recent version and use persistent queues.

Hi Christian,

I use Zabbix to collect Logstash and Elasticsearch performance by management API, mainly _nodes/stats. Sampling interval is 1 minute.
log-cedu-02 and log-cedu-05 are both data nodes, while log-cedu-01 is coordinating node on which logstash is running.
2 hour logstash performance from 14:00 – 16:00 2017/11/13:


It seems that logstash may be blocked sometime (pipeline event count down to 0).

2 hour cluster summary is as follows:


Average indexing latency corresponds to _all.total.indexing.index_time_in_millis. It is converted to minutely difference between two consecutive samples.
Total rate corresponds to _all.total.indexing.index_total. . It is converted to minutely difference between two consecutive samples.
Data node indexing is not continuous, it is in batched. Every time there is an ingest event, data node takes time to consume the batch. Input is blocked during this period.
12 hour performance is as follows:

Refresh

Merge
P5
2 hour GC/traffic/disk
P6
12 hour GC/traffic/disk
P7

Please help advise where may go wrong ?

Thanks,
Shaowen

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.