Gc overhead reduces ElasticSearch Performance

pranay_sankpal · August 19, 2018, 6:48pm

Infrastructure details :
3 * i3.xlarge instances for ES
1 * t2.medium instance for logstash
Volume type : io1

Software details :
ElasticSearch Docker image : 6.3.0
Logstash : 5.3
Beats : 5.3

Requests : 350/sec

Purpose : logging

We are migrating from manages ElasticService provided by AWS to ES on kubernetes. To send data to ES provisioned on kubernetes we are using logstash. Below is the code snippet :

output {
  amazon_es {
    hosts => ["managed-aws-es"]
    region => "ap-southeast-1"
    index => "logstash-%{type}-%{+YYYY.MM.dd}"
    flush_size => 10
  }

  elasticsearch {
    hosts => ["kubernetes-es:9200"]
    index => "logstash-%{type}-%{+YYYY.MM.dd}"
  }
}

When we start sending the data on both the cluster, it becomes slow.
According to me, Logstash don't manage to send data at both the clusters and that increases backpressure on logstash and logs are shipped slowly to both the ElasticSearch cluster. In my opinion
Backpressure on logstash increases because ElasticSearch is not able to process the incoming requests.

In the logs, I got the following errors :

[2018-08-19T18:03:35,413][WARN ][o.e.d.s.f.s.h.UnifiedHighlighter] The length [1027315] of [http.response.body.keyword] field of [VdBbU2UBRaLJ6sM3Mg7X] doc of [packetbeat-2018.08.19] index has exceeded the allowed maximum of [1000000] set for the next major Elastic version. This maximum can be set by changing the [index.highlight.max_analyzed_offset] index level setting. For large texts, indexing with offsets or term vectors is recommended!

[2018-08-19T18:03:35,418][WARN ][o.e.d.s.f.s.h.UnifiedHighlighter] The length [1027485] of [response] field of [VdBbU2UBRaLJ6sM3Mg7X] doc of [packetbeat-2018.08.19] index has exceeded the allowed maximum of [1000000] set for the next major Elastic version. This maximum can be set by changing the [index.highlight.max_analyzed_offset] index level setting. For large texts, indexing with offsets or term vectors is recommended!

[2018-08-19T18:03:40,353][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][805] overhead, spent [411ms] collecting in the last [1.1s]

[2018-08-19T18:03:43,377][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][808] overhead, spent [447ms] collecting in the last [1s]

[2018-08-19T18:03:45,378][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][810] overhead, spent [505ms] collecting in the last [1s]

[2018-08-19T18:03:50,603][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][815] overhead, spent [428ms] collecting in the last [1.2s]

[2018-08-19T18:03:57,720][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][822] overhead, spent [430ms] collecting in the last [1.1s]

[2018-08-19T18:04:11,777][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][836] overhead, spent [560ms] collecting in the last [1s]

[2018-08-19T18:04:16,056][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][840] overhead, spent [389ms] collecting in the last [1.2s]

[2018-08-19T18:04:23,114][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][847] overhead, spent [364ms] collecting in the last [1s]

[2018-08-19T18:04:24,250][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][848] overhead, spent [408ms] collecting in the last [1.1s]

I've already gone through the threads regarding the issue but could not find the solution

This is a puzzle to me, I've spent so much time to get this fixed but always out of luck.
Please suggest.

Christian_Dahlqvist · August 19, 2018, 6:53pm

What does your Elasticsearch config look like? How much heap have you allocated to the nodes? What does the cluster stats API output look like?

pranay_sankpal · August 19, 2018, 6:58pm

I'm using ElasticSearch Docker Image.

I can republish the stats as I've stopped sending data to ES cluster for now

Below are the stats

{
  "_nodes": {
    "total": 6,
    "successful": 6,
    "failed": 0
  },
  "cluster_name": "myesdb",
  "timestamp": 1534704854650,
  "status": "green",
  "indices": {
    "count": 7,
    "shards": {
      "total": 62,
      "primaries": 31,
      "replication": 1,
      "index": {
        "shards": {
          "min": 2,
          "max": 10,
          "avg": 8.857142857142858
        },
        "primaries": {
          "min": 1,
          "max": 5,
          "avg": 4.428571428571429
        },
        "replication": {
          "min": 1,
          "max": 1,
          "avg": 1
        }
      }
    },
    "docs": {
      "count": 37489,
      "deleted": 0
    },
    "store": {
      "size": "379.2mb",
      "size_in_bytes": 397666570
    },
    "fielddata": {
      "memory_size": "0b",
      "memory_size_in_bytes": 0,
      "evictions": 0
    },
    "query_cache": {
      "memory_size": "0b",
      "memory_size_in_bytes": 0,
      "total_count": 0,
      "hit_count": 0,
      "miss_count": 0,
      "cache_size": 0,
      "cache_count": 0,
      "evictions": 0
    },
    "completion": {
      "size": "0b",
      "size_in_bytes": 0
    },
    "segments": {
      "count": 319,
      "memory": "7.7mb",
      "memory_in_bytes": 8086897,
      "terms_memory": "6.4mb",
      "terms_memory_in_bytes": 6782898,
      "stored_fields_memory": "152.8kb",
      "stored_fields_memory_in_bytes": 156512,
      "term_vectors_memory": "0b",
      "term_vectors_memory_in_bytes": 0,
      "norms_memory": "829.2kb",
      "norms_memory_in_bytes": 849152,
      "points_memory": "12.2kb",
      "points_memory_in_bytes": 12547,
      "doc_values_memory": "279kb",
      "doc_values_memory_in_bytes": 285788,
      "index_writer_memory": "0b",
      "index_writer_memory_in_bytes": 0,
      "version_map_memory": "0b",
      "version_map_memory_in_bytes": 0,
      "fixed_bit_set": "0b",
      "fixed_bit_set_memory_in_bytes": 0,
      "max_unsafe_auto_id_timestamp": 1534701562015,
      "file_sizes": {}
    }
  },
  "nodes": {
    "count": {
      "total": 6,
      "data": 3,
      "coordinating_only": 0,
      "master": 3,
      "ingest": 0
    },
    "versions": [
      "6.3.0"
    ],
    "os": {
      "available_processors": 12,
      "allocated_processors": 6,
      "names": [
        {
          "name": "Linux",
          "count": 6
        }
      ],
      "mem": {
        "total": "23.1gb",
        "total_in_bytes": 24878874624,
        "free": "2.6gb",
        "free_in_bytes": 2799800320,
        "used": "20.5gb",
        "used_in_bytes": 22079074304,
        "free_percent": 11,
        "used_percent": 89
      }
    },
    "process": {
      "cpu": {
        "percent": 17
      },
      "open_file_descriptors": {
        "min": 284,
        "max": 388,
        "avg": 334
      }
    },
    "jvm": {
      "max_uptime": "2.1h",
      "max_uptime_in_millis": 7765153,
      "versions": [
        {
          "version": "1.8.0_151",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "25.151-b12",
          "vm_vendor": "Oracle Corporation",
          "count": 6
        }
      ],
      "mem": {
        "heap_used": "819.1mb",
        "heap_used_in_bytes": 858904168,
        "heap_max": "1.4gb",
        "heap_max_in_bytes": 1557135360
      },
      "threads": 180
    },
    "fs": {
      "total": "447.7gb",
      "total_in_bytes": 480820887552,
      "free": "436.9gb",
      "free_in_bytes": 469157634048,
      "available": "416.7gb",
      "available_in_bytes": 447476035584
    },
    "plugins": [],
    "network_types": {
      "transport_types": {
        "security4": 6
      },
      "http_types": {
        "security4": 6
      }
    }
  }
}

Christian_Dahlqvist · August 19, 2018, 7:06pm

based on those stats I would recommend the following:

Run one master/data node per host. It seems you are currently 2 nodes per host.
Assuming nothing else is running on the Elasticsearch hosts, set heap to 50% of the available RAM.
You currently appear to have very small shards. If this is volumes you expect going forward, make sure to reduce the number of primary shards per index and consider switching from daily to e.g. weekly or monthly indices. See this blog post for further guidance.

pranay_sankpal · August 19, 2018, 7:18pm

Hi @Christian_Dahlqvist ,

Thanks for replying.
I've posted the cluster stats of the ES we provisioned on Kubernetes. You are right about master/data node per host. I'll make those changes.

We are migrating from Manages ES service on AWS to ES on k8s. These stats are published after running ES cluster for approx 20 to 30 mins when the number of request are less.

Below is the cluster stats for the managed ES on AWS.

{
  "_nodes": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "cluster_name": "XXXXXXXXXXX:pe-es-logs-prod",
  "timestamp": 1534705702672,
  "status": "green",
  "indices": {
    "count": 84,
    "shards": {
      "total": 832,
      "primaries": 416,
      "replication": 1,
      "index": {
        "shards": {
          "min": 2,
          "max": 10,
          "avg": 9.904761904761905
        },
        "primaries": {
          "min": 1,
          "max": 5,
          "avg": 4.9523809523809526
        },
        "replication": {
          "min": 1,
          "max": 1,
          "avg": 1
        }
      }
    },
    "docs": {
      "count": 507124798,
      "deleted": 5
    },
    "store": {
      "size": "1.8tb",
      "size_in_bytes": 1998240198501,
      "throttle_time": "0s",
      "throttle_time_in_millis": 0
    },
    "fielddata": {
      "memory_size": "34.4kb",
      "memory_size_in_bytes": 35272,
      "evictions": 0
    },
    "query_cache": {
      "memory_size": "247.9mb",
      "memory_size_in_bytes": 260030288,
      "total_count": 25923042,
      "hit_count": 17825998,
      "miss_count": 8097044,
      "cache_size": 21554,
      "cache_count": 133229,
      "evictions": 111675
    },
    "completion": {
      "size": "0b",
      "size_in_bytes": 0
    },
    "segments": {
      "count": 12412,
      "memory": "2.8gb",
      "memory_in_bytes": 3045784360,
      "terms_memory": "2.3gb",
      "terms_memory_in_bytes": 2503572592,
      "stored_fields_memory": "308.1mb",
      "stored_fields_memory_in_bytes": 323165272,
      "term_vectors_memory": "0b",
      "term_vectors_memory_in_bytes": 0,
      "norms_memory": "47.5mb",
      "norms_memory_in_bytes": 49830144,
      "points_memory": "28.3mb",
      "points_memory_in_bytes": 29681152,
      "doc_values_memory": "133mb",
      "doc_values_memory_in_bytes": 139535200,
      "index_writer_memory": "57mb",
      "index_writer_memory_in_bytes": 59821900,
      "version_map_memory": "16.5kb",
      "version_map_memory_in_bytes": 16922,
      "fixed_bit_set": "0b",
      "fixed_bit_set_memory_in_bytes": 0,
      "max_unsafe_auto_id_timestamp": 1534636811120,
      "file_sizes": {}
    }
  },
  "nodes": {
    "count": {
      "total": 2,
      "data": 2,
      "coordinating_only": 0,
      "master": 2,
      "ingest": 2
    },
    "versions": [
      "5.3.2"
    ],
    "os": {
      "available_processors": 4,
      "allocated_processors": 4,
      "names": [
        {
          "count": 2
        }
      ],
      "mem": {
        "total": "29.8gb",
        "total_in_bytes": 32061538304,
        "free": "595.6mb",
        "free_in_bytes": 624611328,
        "used": "29.2gb",
        "used_in_bytes": 31436926976,
        "free_percent": 2,
        "used_percent": 98
      }
    },
    "process": {
      "cpu": {
        "percent": 189
      },
      "open_file_descriptors": {
        "min": 1459,
        "max": 1460,
        "avg": 1459
      }
    },
    "jvm": {
      "max_uptime": "86.4d",
      "max_uptime_in_millis": 7470716361,
      "mem": {
        "heap_used": "11.1gb",
        "heap_used_in_bytes": 11922677008,
        "heap_max": "15.9gb",
        "heap_max_in_bytes": 17145004032
      },
      "threads": 283
    },
    "fs": {
      "total": "1.9tb",
      "total_in_bytes": 2164250746880,
      "free": "149.5gb",
      "free_in_bytes": 160558526464,
      "available": "47.1gb",
      "available_in_bytes": 50573815808
    },
    "network_types": {
      "transport_types": {
        "netty4": 2
      },
      "http_types": {
        "filter-jetty": 2
      }
    }
  }
}

We are expecting to handle same amount of traffic on the ES on k8s.

Why we are getting these warning messages?
How do I increase heap size?

pranay_sankpal · August 19, 2018, 7:25pm

Any specific reason to get these errors?

[2018-08-19T18:03:35,418][WARN ][o.e.d.s.f.s.h.UnifiedHighlighter] The length [1027485] of [response] field of [VdBbU2UBRaLJ6sM3Mg7X] doc of [packetbeat-2018.08.19] index has exceeded the allowed maximum of [1000000] set for the next major Elastic version. This maximum can be set by changing the [index.highlight.max_analyzed_offset] index level setting. For large texts, indexing with offsets or term vectors is recommended!

[2018-08-19T18:03:40,353][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][805] overhead, spent [411ms] collecting in the last [1.1s]

Christian_Dahlqvist · August 19, 2018, 7:25pm

Those 2 nodes have considerably more heap that in your new cluster, which is the problem. You can specify the heap size in the jvm.options config file.

pranay_sankpal · August 19, 2018, 7:29pm

Sure, I'll try to increase the heap size and run the cluster.

pranay_sankpal · August 22, 2018, 3:22am

HI @Christian_Dahlqvist,

Increasing size of heap memory worked for us.

Thank you

You can close the thread !!

pranay_sankpal · August 22, 2018, 5:56am

HI @Christian_Dahlqvist,

We are working on keeping ES pods in separate nodes. After increasing heap size, ES was working fine and after some time it became slow.

Below are the _cluster/stats

{
  "_nodes": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "cluster_name": "myesdb",
  "timestamp": 1534916786894,
  "status": "green",
  "indices": {
    "count": 17,
    "shards": {
      "total": 162,
      "primaries": 81,
      "replication": 1,
      "index": {
        "shards": {
          "min": 2,
          "max": 10,
          "avg": 9.529411764705882
        },
        "primaries": {
          "min": 1,
          "max": 5,
          "avg": 4.764705882352941
        },
        "replication": {
          "min": 1,
          "max": 1,
          "avg": 1
        }
      }
    },
    "docs": {
      "count": 5290609,
      "deleted": 1
    },
    "store": {
      "size": "34.8gb",
      "size_in_bytes": 37373921725
    },
    "fielddata": {
      "memory_size": "0b",
      "memory_size_in_bytes": 0,
      "evictions": 0
    },
    "query_cache": {
      "memory_size": "0b",
      "memory_size_in_bytes": 0,
      "total_count": 0,
      "hit_count": 0,
      "miss_count": 0,
      "cache_size": 0,
      "cache_count": 0,
      "evictions": 0
    },
    "completion": {
      "size": "0b",
      "size_in_bytes": 0
    },
    "segments": {
      "count": 1277,
      "memory": "80.9mb",
      "memory_in_bytes": 84835917,
      "terms_memory": "64.8mb",
      "terms_memory_in_bytes": 67965893,
      "stored_fields_memory": "7.1mb",
      "stored_fields_memory_in_bytes": 7508760,
      "term_vectors_memory": "0b",
      "term_vectors_memory_in_bytes": 0,
      "norms_memory": "4.7mb",
      "norms_memory_in_bytes": 4929408,
      "points_memory": "456.6kb",
      "points_memory_in_bytes": 467604,
      "doc_values_memory": "3.7mb",
      "doc_values_memory_in_bytes": 3964252,
      "index_writer_memory": "92.6mb",
      "index_writer_memory_in_bytes": 97164120,
      "version_map_memory": "0b",
      "version_map_memory_in_bytes": 0,
      "fixed_bit_set": "0b",
      "fixed_bit_set_memory_in_bytes": 0,
      "max_unsafe_auto_id_timestamp": 1534914982609,
      "file_sizes": {}
    }
  },
  "nodes": {
    "count": {
      "total": 5,
      "data": 3,
      "coordinating_only": 0,
      "master": 2,
      "ingest": 0
    },
    "versions": [
      "6.3.0"
    ],
    "os": {
      "available_processors": 20,
      "allocated_processors": 5,
      "names": [
        {
          "name": "Linux",
          "count": 5
        }
      ],
      "mem": {
        "total": "149.7gb",
        "total_in_bytes": 160780206080,
        "free": "3.2gb",
        "free_in_bytes": 3477340160,
        "used": "146.4gb",
        "used_in_bytes": 157302865920,
        "free_percent": 2,
        "used_percent": 98
      }
    },
    "process": {
      "cpu": {
        "percent": 59
      },
      "open_file_descriptors": {
        "min": 262,
        "max": 677,
        "avg": 510
      }
    },
    "jvm": {
      "max_uptime": "1.3d",
      "max_uptime_in_millis": 115213630,
      "versions": [
        {
          "version": "1.8.0_151",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "25.151-b12",
          "vm_vendor": "Oracle Corporation",
          "count": 5
        }
      ],
      "mem": {
        "heap_used": "8.9gb",
        "heap_used_in_bytes": 9654245712,
        "heap_max": "36.8gb",
        "heap_max_in_bytes": 39588069376
      },
      "threads": 155
    },
    "fs": {
      "total": "327.9gb",
      "total_in_bytes": 352119877632,
      "free": "271.7gb",
      "free_in_bytes": 291810332672,
      "available": "256.7gb",
      "available_in_bytes": 275728543744
    },
    "plugins": [],
    "network_types": {
      "transport_types": {
        "security4": 5
      },
      "http_types": {
        "security4": 5
      }
    }
  }
}

Christian_Dahlqvist · August 22, 2018, 6:24am

What do you mean by "it became slow"? What is slow?

pranay_sankpal · August 22, 2018, 6:48am

Hi @Christian_Dahlqvist,

We are also trying to figure out the problem from our end. Usually number of documents send by packetbeat to ES are approximately 4000 documents /min. And I'm able to view the data with the lag of 2-3 mins in kibana. Our realtime analysis work on this data in the ElasticSearch.

The delay which was 2-3 mins previously has became 35-40 mins and we are also experiencing loss in the data. We are only receiving ~1200 docs/min.

My first guess was logstash could not handle the back pressure, hence we upgraded logstash server to m4.large but it didn't work.

Above are the cluster stats. My primary guess is

    "mem": {
        "total": "149.7gb",
        "total_in_bytes": 160780206080,
        "free": "3.2gb",
        "free_in_bytes": 3477340160,
        "used": "146.4gb",
        "used_in_bytes": 157302865920,
        "free_percent": 2,
        "used_percent": 98
      }

memory was 99% full, so I ran POST _flush
It got full again.

We ran ES for 1 day and inserted only 5290609 documents.

"docs": {
  "count": 5290609,
  "deleted": 1
},

Could you please suggest?

Christian_Dahlqvist · August 22, 2018, 6:57am

That is indeed very slow. Logstash is generally limited by CPU, so I do not think using a t2 instance type for Logstash is suitable as this has very limited CPU quotas. I would recommend upgrading this to a small m4/m5 to see if this makes a difference. I suspect it will.

pranay_sankpal · August 25, 2018, 7:12am

Thanks @Christian_Dahlqvist, upgrading logstash and increasing size of heap memory worked for me.

system · September 22, 2018, 7:12am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash limitting ElasticSearch heap Elasticsearch	5	451	July 6, 2017
Performance Limitation with ELK stack Elasticsearch	7	2887	July 6, 2017
Write throughput test on elasticsearch 9 high configuration nodes cluster, Elasticsearch	13	2019	July 6, 2017
Large Scale elastic Search Logstash collection system Elasticsearch	6	449	July 6, 2017
Increasing elasticsearch indexing rate Elasticsearch	14	12894	March 9, 2017

Gc overhead reduces ElasticSearch Performance

Related topics