Error code 429 - circuit_breaking_exception

Hi Elastic team,

I got this error from Logstash logs.

[2019-10-07T07:40:55,341][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"circuit_breaking_exception", "reason"=>"[parent] Data too large, data for [<transport_request>] would be [16414928216/15.2gb], which is larger than the limit of [16320875724/15.1gb], real usage: [16414925088/15.2gb], new bytes reserved: [3128/3kb]", "bytes_wanted"=>16414928216, "bytes_limit"=>16320875724, "durability"=>"TRANSIENT"})

I know you have questions on this but I don't understand it clearly in the other posts.

My cluster hardware specification details

  • Master: 3 nodes (CPU 4 RAM 32, HEAP SIZE 16GB)
  • Hot-data (and Ingest): 12 nodes (CPU 8 RAM 64, HEAP SIZE 32GB)
  • Warm-data: 3 nodes (CPU 8 RAM 64, HEAP SIZE 32GB)
  • Cold-data: 3 nodes (CPU 8 RAM 64, HEAP SIZE 32GB)

My cluster usage

  • Indexing rate: 5,000 - 10,000+ / sec (Only primary shards)
  • Indices: 3500+
  • Primary shards: 4594
  • Active shards: 9192
  • Shards / hot-data node: ~300 (Max is 400+ during ILM)
  • Shards / warm-data node: ~600 - 700 (Max is 800+ during ILM) (Indices are read-only)
  • Shards / cold-data node: ~1100 - 1200 (Max is 1300+ during ILM) (Indices are read-only and frozen)

From the error above, your suggestion is to increase the heap size in the other posts.
However, I'm not sure what nodes that I should increase the heap size; master, hot-data (ingest)???

Thank you

@worapojc Very recently I had circuit breaker exceptions that were caused by the heap setting in jvm.option on our ingest nodes. Those only had 4GB RAM. Increased to 12GB and didn't see any circuit breakers errors since.

Thank you. It is strange for my case. The ingest nodes already have 32GB RAM for Heap size. Only the master nodes have 16GB RAM for Heap size.

@worapojc Could it be that you are sending those requests to the master node(s) given that the circuit breaker exception contains a reference to ~16GB of heap?
I think removing the hosts of the master nodes from your logstash configuration and sending requests to your hot nodes might help here.

In case it doesn't feel free to share your jvm.options so I can take a look and see if something can be optimized there.

Thanks, Armin. The logstash output configuration only has the hot-data nodes.

Currently, I have increased the memory size of the master nodes to be 32GB of heap.
The issue has fixed but another issue still exists.

Some APIs are timeout response.

  • _cat/shards
  • _cat/nodes
  • _cat/indices
  • _cluster/stats

The response is..

{
"statusCode": 504,
"error": "Gateway Time-out",
"message": "Client request timeout"
}

@worapojc Elasticsearch never returns a 504 from its APIs. The issue must be coming from something (HTTP proxy of some sort or Kibana) between the client and ES. You shouldn't see those errors when directly calling the ES REST API.

Thanks @Armin_Braun.

I've tested the API directly. The response time is high.

  • _cat/shards : 92.859375s
  • _cat/nodes : 90.799507s
  • _cat/indices : 97.222450s
  • _cluster/stats : 84.900175s

This cluster is still working for indexing and searching but the API response time issue occurred for a week. The Kibana monitoring malfunction.

How to resolve this issue?

@willemdh, This is because ES maintains some data structures in your heap permanently, which is very well related to the amount of data you have indexed. We have played a lot around this, and the only solution we got is firstly following ES suggestions for tuning indexed data. Secondly, design your system to scale horizontally so that each data node holds lower amount of data, which in turn helps consuming less heap.
Heap size can be different for each use case. Increasing RAM after a certain amount is not a solution, the more amount of RAM you give for system to play with, that faster is your query response.

It's hard to tell what causes this and there's multiple possible causes. I would try and look into whether one or multiple of your nodes are abnormally slow for some reason (e.g. they could be swapping which would probably show as very long GC times and warnings about those in their logs).

Thanks Armin. I did the rolling restart all hot-data nodes. The API is good now.