Gc overhead reduces ElasticSearch Performance

Infrastructure details :
3 * i3.xlarge instances for ES
1 * t2.medium instance for logstash
Volume type : io1

Software details :
ElasticSearch Docker image : 6.3.0
Logstash : 5.3
Beats : 5.3

Requests : 350/sec

Purpose : logging

We are migrating from manages ElasticService provided by AWS to ES on kubernetes. To send data to ES provisioned on kubernetes we are using logstash. Below is the code snippet :

output {
  amazon_es {
    hosts => ["managed-aws-es"]
    region => "ap-southeast-1"
    index => "logstash-%{type}-%{+YYYY.MM.dd}"
    flush_size => 10
  }

  elasticsearch {
    hosts => ["kubernetes-es:9200"]
    index => "logstash-%{type}-%{+YYYY.MM.dd}"
  }
}

When we start sending the data on both the cluster, it becomes slow.
According to me, Logstash don't manage to send data at both the clusters and that increases backpressure on logstash and logs are shipped slowly to both the ElasticSearch cluster. In my opinion
Backpressure on logstash increases because ElasticSearch is not able to process the incoming requests.

In the logs, I got the following errors :

[2018-08-19T18:03:35,413][WARN ][o.e.d.s.f.s.h.UnifiedHighlighter] The length [1027315] of [http.response.body.keyword] field of [VdBbU2UBRaLJ6sM3Mg7X] doc of [packetbeat-2018.08.19] index has exceeded the allowed maximum of [1000000] set for the next major Elastic version. This maximum can be set by changing the [index.highlight.max_analyzed_offset] index level setting. For large texts, indexing with offsets or term vectors is recommended!

[2018-08-19T18:03:35,418][WARN ][o.e.d.s.f.s.h.UnifiedHighlighter] The length [1027485] of [response] field of [VdBbU2UBRaLJ6sM3Mg7X] doc of [packetbeat-2018.08.19] index has exceeded the allowed maximum of [1000000] set for the next major Elastic version. This maximum can be set by changing the [index.highlight.max_analyzed_offset] index level setting. For large texts, indexing with offsets or term vectors is recommended!

[2018-08-19T18:03:40,353][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][805] overhead, spent [411ms] collecting in the last [1.1s]

[2018-08-19T18:03:43,377][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][808] overhead, spent [447ms] collecting in the last [1s]

[2018-08-19T18:03:45,378][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][810] overhead, spent [505ms] collecting in the last [1s]

[2018-08-19T18:03:50,603][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][815] overhead, spent [428ms] collecting in the last [1.2s]

[2018-08-19T18:03:57,720][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][822] overhead, spent [430ms] collecting in the last [1.1s]

[2018-08-19T18:04:11,777][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][836] overhead, spent [560ms] collecting in the last [1s]

[2018-08-19T18:04:16,056][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][840] overhead, spent [389ms] collecting in the last [1.2s]

[2018-08-19T18:04:23,114][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][847] overhead, spent [364ms] collecting in the last [1s]

[2018-08-19T18:04:24,250][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][848] overhead, spent [408ms] collecting in the last [1.1s]

I've already gone through the threads regarding the issue but could not find the solution

This is a puzzle to me, I've spent so much time to get this fixed but always out of luck.
Please suggest.

What does your Elasticsearch config look like? How much heap have you allocated to the nodes? What does the cluster stats API output look like?

I'm using ElasticSearch Docker Image.

I can republish the stats as I've stopped sending data to ES cluster for now

Below are the stats

{
  "_nodes": {
    "total": 6,
    "successful": 6,
    "failed": 0
  },
  "cluster_name": "myesdb",
  "timestamp": 1534704854650,
  "status": "green",
  "indices": {
    "count": 7,
    "shards": {
      "total": 62,
      "primaries": 31,
      "replication": 1,
      "index": {
        "shards": {
          "min": 2,
          "max": 10,
          "avg": 8.857142857142858
        },
        "primaries": {
          "min": 1,
          "max": 5,
          "avg": 4.428571428571429
        },
        "replication": {
          "min": 1,
          "max": 1,
          "avg": 1
        }
      }
    },
    "docs": {
      "count": 37489,
      "deleted": 0
    },
    "store": {
      "size": "379.2mb",
      "size_in_bytes": 397666570
    },
    "fielddata": {
      "memory_size": "0b",
      "memory_size_in_bytes": 0,
      "evictions": 0
    },
    "query_cache": {
      "memory_size": "0b",
      "memory_size_in_bytes": 0,
      "total_count": 0,
      "hit_count": 0,
      "miss_count": 0,
      "cache_size": 0,
      "cache_count": 0,
      "evictions": 0
    },
    "completion": {
      "size": "0b",
      "size_in_bytes": 0
    },
    "segments": {
      "count": 319,
      "memory": "7.7mb",
      "memory_in_bytes": 8086897,
      "terms_memory": "6.4mb",
      "terms_memory_in_bytes": 6782898,
      "stored_fields_memory": "152.8kb",
      "stored_fields_memory_in_bytes": 156512,
      "term_vectors_memory": "0b",
      "term_vectors_memory_in_bytes": 0,
      "norms_memory": "829.2kb",
      "norms_memory_in_bytes": 849152,
      "points_memory": "12.2kb",
      "points_memory_in_bytes": 12547,
      "doc_values_memory": "279kb",
      "doc_values_memory_in_bytes": 285788,
      "index_writer_memory": "0b",
      "index_writer_memory_in_bytes": 0,
      "version_map_memory": "0b",
      "version_map_memory_in_bytes": 0,
      "fixed_bit_set": "0b",
      "fixed_bit_set_memory_in_bytes": 0,
      "max_unsafe_auto_id_timestamp": 1534701562015,
      "file_sizes": {}
    }
  },
  "nodes": {
    "count": {
      "total": 6,
      "data": 3,
      "coordinating_only": 0,
      "master": 3,
      "ingest": 0
    },
    "versions": [
      "6.3.0"
    ],
    "os": {
      "available_processors": 12,
      "allocated_processors": 6,
      "names": [
        {
          "name": "Linux",
          "count": 6
        }
      ],
      "mem": {
        "total": "23.1gb",
        "total_in_bytes": 24878874624,
        "free": "2.6gb",
        "free_in_bytes": 2799800320,
        "used": "20.5gb",
        "used_in_bytes": 22079074304,
        "free_percent": 11,
        "used_percent": 89
      }
    },
    "process": {
      "cpu": {
        "percent": 17
      },
      "open_file_descriptors": {
        "min": 284,
        "max": 388,
        "avg": 334
      }
    },
    "jvm": {
      "max_uptime": "2.1h",
      "max_uptime_in_millis": 7765153,
      "versions": [
        {
          "version": "1.8.0_151",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "25.151-b12",
          "vm_vendor": "Oracle Corporation",
          "count": 6
        }
      ],
      "mem": {
        "heap_used": "819.1mb",
        "heap_used_in_bytes": 858904168,
        "heap_max": "1.4gb",
        "heap_max_in_bytes": 1557135360
      },
      "threads": 180
    },
    "fs": {
      "total": "447.7gb",
      "total_in_bytes": 480820887552,
      "free": "436.9gb",
      "free_in_bytes": 469157634048,
      "available": "416.7gb",
      "available_in_bytes": 447476035584
    },
    "plugins": [],
    "network_types": {
      "transport_types": {
        "security4": 6
      },
      "http_types": {
        "security4": 6
      }
    }
  }
}

based on those stats I would recommend the following:

  • Run one master/data node per host. It seems you are currently 2 nodes per host.
  • Assuming nothing else is running on the Elasticsearch hosts, set heap to 50% of the available RAM.
  • You currently appear to have very small shards. If this is volumes you expect going forward, make sure to reduce the number of primary shards per index and consider switching from daily to e.g. weekly or monthly indices. See this blog post for further guidance.
1 Like

Hi @Christian_Dahlqvist ,

Thanks for replying.
I've posted the cluster stats of the ES we provisioned on Kubernetes. You are right about master/data node per host. I'll make those changes.

We are migrating from Manages ES service on AWS to ES on k8s. These stats are published after running ES cluster for approx 20 to 30 mins when the number of request are less.

Below is the cluster stats for the managed ES on AWS.

{
  "_nodes": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "cluster_name": "XXXXXXXXXXX:pe-es-logs-prod",
  "timestamp": 1534705702672,
  "status": "green",
  "indices": {
    "count": 84,
    "shards": {
      "total": 832,
      "primaries": 416,
      "replication": 1,
      "index": {
        "shards": {
          "min": 2,
          "max": 10,
          "avg": 9.904761904761905
        },
        "primaries": {
          "min": 1,
          "max": 5,
          "avg": 4.9523809523809526
        },
        "replication": {
          "min": 1,
          "max": 1,
          "avg": 1
        }
      }
    },
    "docs": {
      "count": 507124798,
      "deleted": 5
    },
    "store": {
      "size": "1.8tb",
      "size_in_bytes": 1998240198501,
      "throttle_time": "0s",
      "throttle_time_in_millis": 0
    },
    "fielddata": {
      "memory_size": "34.4kb",
      "memory_size_in_bytes": 35272,
      "evictions": 0
    },
    "query_cache": {
      "memory_size": "247.9mb",
      "memory_size_in_bytes": 260030288,
      "total_count": 25923042,
      "hit_count": 17825998,
      "miss_count": 8097044,
      "cache_size": 21554,
      "cache_count": 133229,
      "evictions": 111675
    },
    "completion": {
      "size": "0b",
      "size_in_bytes": 0
    },
    "segments": {
      "count": 12412,
      "memory": "2.8gb",
      "memory_in_bytes": 3045784360,
      "terms_memory": "2.3gb",
      "terms_memory_in_bytes": 2503572592,
      "stored_fields_memory": "308.1mb",
      "stored_fields_memory_in_bytes": 323165272,
      "term_vectors_memory": "0b",
      "term_vectors_memory_in_bytes": 0,
      "norms_memory": "47.5mb",
      "norms_memory_in_bytes": 49830144,
      "points_memory": "28.3mb",
      "points_memory_in_bytes": 29681152,
      "doc_values_memory": "133mb",
      "doc_values_memory_in_bytes": 139535200,
      "index_writer_memory": "57mb",
      "index_writer_memory_in_bytes": 59821900,
      "version_map_memory": "16.5kb",
      "version_map_memory_in_bytes": 16922,
      "fixed_bit_set": "0b",
      "fixed_bit_set_memory_in_bytes": 0,
      "max_unsafe_auto_id_timestamp": 1534636811120,
      "file_sizes": {}
    }
  },
  "nodes": {
    "count": {
      "total": 2,
      "data": 2,
      "coordinating_only": 0,
      "master": 2,
      "ingest": 2
    },
    "versions": [
      "5.3.2"
    ],
    "os": {
      "available_processors": 4,
      "allocated_processors": 4,
      "names": [
        {
          "count": 2
        }
      ],
      "mem": {
        "total": "29.8gb",
        "total_in_bytes": 32061538304,
        "free": "595.6mb",
        "free_in_bytes": 624611328,
        "used": "29.2gb",
        "used_in_bytes": 31436926976,
        "free_percent": 2,
        "used_percent": 98
      }
    },
    "process": {
      "cpu": {
        "percent": 189
      },
      "open_file_descriptors": {
        "min": 1459,
        "max": 1460,
        "avg": 1459
      }
    },
    "jvm": {
      "max_uptime": "86.4d",
      "max_uptime_in_millis": 7470716361,
      "mem": {
        "heap_used": "11.1gb",
        "heap_used_in_bytes": 11922677008,
        "heap_max": "15.9gb",
        "heap_max_in_bytes": 17145004032
      },
      "threads": 283
    },
    "fs": {
      "total": "1.9tb",
      "total_in_bytes": 2164250746880,
      "free": "149.5gb",
      "free_in_bytes": 160558526464,
      "available": "47.1gb",
      "available_in_bytes": 50573815808
    },
    "network_types": {
      "transport_types": {
        "netty4": 2
      },
      "http_types": {
        "filter-jetty": 2
      }
    }
  }
}

We are expecting to handle same amount of traffic on the ES on k8s.

Why we are getting these warning messages?
How do I increase heap size?

Any specific reason to get these errors?

[2018-08-19T18:03:35,418][WARN ][o.e.d.s.f.s.h.UnifiedHighlighter] The length [1027485] of [response] field of [VdBbU2UBRaLJ6sM3Mg7X] doc of [packetbeat-2018.08.19] index has exceeded the allowed maximum of [1000000] set for the next major Elastic version. This maximum can be set by changing the [index.highlight.max_analyzed_offset] index level setting. For large texts, indexing with offsets or term vectors is recommended!

[2018-08-19T18:03:40,353][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-2] [gc][805] overhead, spent [411ms] collecting in the last [1.1s]

Those 2 nodes have considerably more heap that in your new cluster, which is the problem. You can specify the heap size in the jvm.options config file.

Sure, I'll try to increase the heap size and run the cluster.

HI @Christian_Dahlqvist,

Increasing size of heap memory worked for us.

Thank you :slight_smile:

You can close the thread !!

HI @Christian_Dahlqvist,

We are working on keeping ES pods in separate nodes. After increasing heap size, ES was working fine and after some time it became slow.

Below are the _cluster/stats

{
  "_nodes": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "cluster_name": "myesdb",
  "timestamp": 1534916786894,
  "status": "green",
  "indices": {
    "count": 17,
    "shards": {
      "total": 162,
      "primaries": 81,
      "replication": 1,
      "index": {
        "shards": {
          "min": 2,
          "max": 10,
          "avg": 9.529411764705882
        },
        "primaries": {
          "min": 1,
          "max": 5,
          "avg": 4.764705882352941
        },
        "replication": {
          "min": 1,
          "max": 1,
          "avg": 1
        }
      }
    },
    "docs": {
      "count": 5290609,
      "deleted": 1
    },
    "store": {
      "size": "34.8gb",
      "size_in_bytes": 37373921725
    },
    "fielddata": {
      "memory_size": "0b",
      "memory_size_in_bytes": 0,
      "evictions": 0
    },
    "query_cache": {
      "memory_size": "0b",
      "memory_size_in_bytes": 0,
      "total_count": 0,
      "hit_count": 0,
      "miss_count": 0,
      "cache_size": 0,
      "cache_count": 0,
      "evictions": 0
    },
    "completion": {
      "size": "0b",
      "size_in_bytes": 0
    },
    "segments": {
      "count": 1277,
      "memory": "80.9mb",
      "memory_in_bytes": 84835917,
      "terms_memory": "64.8mb",
      "terms_memory_in_bytes": 67965893,
      "stored_fields_memory": "7.1mb",
      "stored_fields_memory_in_bytes": 7508760,
      "term_vectors_memory": "0b",
      "term_vectors_memory_in_bytes": 0,
      "norms_memory": "4.7mb",
      "norms_memory_in_bytes": 4929408,
      "points_memory": "456.6kb",
      "points_memory_in_bytes": 467604,
      "doc_values_memory": "3.7mb",
      "doc_values_memory_in_bytes": 3964252,
      "index_writer_memory": "92.6mb",
      "index_writer_memory_in_bytes": 97164120,
      "version_map_memory": "0b",
      "version_map_memory_in_bytes": 0,
      "fixed_bit_set": "0b",
      "fixed_bit_set_memory_in_bytes": 0,
      "max_unsafe_auto_id_timestamp": 1534914982609,
      "file_sizes": {}
    }
  },
  "nodes": {
    "count": {
      "total": 5,
      "data": 3,
      "coordinating_only": 0,
      "master": 2,
      "ingest": 0
    },
    "versions": [
      "6.3.0"
    ],
    "os": {
      "available_processors": 20,
      "allocated_processors": 5,
      "names": [
        {
          "name": "Linux",
          "count": 5
        }
      ],
      "mem": {
        "total": "149.7gb",
        "total_in_bytes": 160780206080,
        "free": "3.2gb",
        "free_in_bytes": 3477340160,
        "used": "146.4gb",
        "used_in_bytes": 157302865920,
        "free_percent": 2,
        "used_percent": 98
      }
    },
    "process": {
      "cpu": {
        "percent": 59
      },
      "open_file_descriptors": {
        "min": 262,
        "max": 677,
        "avg": 510
      }
    },
    "jvm": {
      "max_uptime": "1.3d",
      "max_uptime_in_millis": 115213630,
      "versions": [
        {
          "version": "1.8.0_151",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "25.151-b12",
          "vm_vendor": "Oracle Corporation",
          "count": 5
        }
      ],
      "mem": {
        "heap_used": "8.9gb",
        "heap_used_in_bytes": 9654245712,
        "heap_max": "36.8gb",
        "heap_max_in_bytes": 39588069376
      },
      "threads": 155
    },
    "fs": {
      "total": "327.9gb",
      "total_in_bytes": 352119877632,
      "free": "271.7gb",
      "free_in_bytes": 291810332672,
      "available": "256.7gb",
      "available_in_bytes": 275728543744
    },
    "plugins": [],
    "network_types": {
      "transport_types": {
        "security4": 5
      },
      "http_types": {
        "security4": 5
      }
    }
  }
}

What do you mean by "it became slow"? What is slow?

Hi @Christian_Dahlqvist,

We are also trying to figure out the problem from our end. Usually number of documents send by packetbeat to ES are approximately 4000 documents /min. And I'm able to view the data with the lag of 2-3 mins in kibana. Our realtime analysis work on this data in the ElasticSearch.

The delay which was 2-3 mins previously has became 35-40 mins and we are also experiencing loss in the data. We are only receiving ~1200 docs/min.

My first guess was logstash could not handle the back pressure, hence we upgraded logstash server to m4.large but it didn't work.

Above are the cluster stats. My primary guess is

    "mem": {
        "total": "149.7gb",
        "total_in_bytes": 160780206080,
        "free": "3.2gb",
        "free_in_bytes": 3477340160,
        "used": "146.4gb",
        "used_in_bytes": 157302865920,
        "free_percent": 2,
        "used_percent": 98
      }

memory was 99% full, so I ran POST _flush
It got full again.

We ran ES for 1 day and inserted only 5290609 documents.

"docs": {
  "count": 5290609,
  "deleted": 1
},

Could you please suggest?

That is indeed very slow. Logstash is generally limited by CPU, so I do not think using a t2 instance type for Logstash is suitable as this has very limited CPU quotas. I would recommend upgrading this to a small m4/m5 to see if this makes a difference. I suspect it will.

Thanks @Christian_Dahlqvist, upgrading logstash and increasing size of heap memory worked for me. :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.