Elastic cluster "Received ban for the parent [channel closed]", Apache Metron "listener timeout after waiting for"

ThreatInter · December 14, 2020, 5:23am

Hello, we using ES cluster with 3 nodes, each datanode. We faced with problem that sometimes we can't write into our cluster using Apache Metron :

java.io.IOException: listener timeout after waiting for [60000] ms at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:660) at org.elasticsearch.client.RestClient.performRequest(RestClient.java:219) at org.elasticsearch.client.RestClient.performRequest(RestClient.java:191) at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:396) at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:382) at org.elasticsearch.client.RestHighLevelClient.bulk(RestHighLevelClient.java:197) at org.apache.metron.elasticsearch.bulk.ElasticsearchBulkDocumentWriter.write(ElasticsearchBulkDocumentWriter.java:89) at org.apache.metron.elasticsearch.writer.ElasticsearchWriter.write(ElasticsearchWriter.java:105) at org.apache.metron.writer.BulkWriterComponent.flush(BulkWriterComponent.java:123) at org.apache.metron.writer.BulkWriterComponent.applyShouldFlush(BulkWriterComponent.java:179) at org.apache.metron.writer.BulkWriterComponent.write(BulkWriterComponent.java:99) at org.apache.metron.writer.bolt.BulkMessageWriterBolt.execute(BulkMessageWriterBolt.java:303) at org.apache.storm.daemon.executor$fn__10219$tuple_action_fn__10221.invoke(executor.clj:745) at org.apache.storm.daemon.executor$mk_task_receiver$fn__10138.invoke(executor.clj:473) at org.apache.storm.disruptor$clojure_handler$reify__4115.onEvent(disruptor.clj:41) at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:509) at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:487) at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:74) at org.apache.storm.daemon.executor$fn__10219$fn__10232$fn__10287.invoke(executor.clj:868) at org.apache.storm.util$async_loop$fn__1221.invoke(util.clj:484) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:748)

At this time at cluster nodes we get this messages:

[2020-12-14T10:00:04,292][DEBUG][o.e.t.TaskCancellationService] [h1-es03] Received ban for the parent [MT3BSgtaQBWux8BJDBSsHg:53280489] on the node [Qshtg7-TQIyxeiccpkmlIA], reason: [channel closed]
[2020-12-14T10:00:04,463][DEBUG][o.e.t.TaskCancellationService] [h1-es03] Received ban for the parent [MT3BSgtaQBWux8BJDBSsHg:53280512] on the node [Qshtg7-TQIyxeiccpkmlIA], reason: [channel closed]
[2020-12-14T10:00:04,523][DEBUG][o.e.t.TaskCancellationService] [h1-es03] Received ban for the parent [MT3BSgtaQBWux8BJDBSsHg:53280620] on the node [Qshtg7-TQIyxeiccpkmlIA], reason: [channel closed]
[2020-12-14T10:00:04,631][DEBUG][o.e.t.TaskCancellationService] [h1-es03] Received ban for the parent [MT3BSgtaQBWux8BJDBSsHg:53280799] on the node [Qshtg7-TQIyxeiccpkmlIA], reason: [channel closed]
[2020-12-14T10:00:04,689][DEBUG][o.e.t.TaskCancellationService] [h1-es03] Received ban for the parent [MT3BSgtaQBWux8BJDBSsHg:53280902] on the node [Qshtg7-TQIyxeiccpkmlIA], reason: [channel closed]
[2020-12-14T10:00:04,857][DEBUG][o.e.t.TaskCancellationService] [h1-es03] Received ban for the parent [MT3BSgtaQBWux8BJDBSsHg:53281046] on the node [Qshtg7-TQIyxeiccpkmlIA], reason: [channel closed]

And also:

org.elasticsearch.transport.TransportException: failure to send
...
Caused by: org.elasticsearch.tasks.TaskCancelledException: The parent task was cancelled, shouldn't start any child tasks

Have someone else faced with it? Help us to understand what's going on, please!

Christian_Dahlqvist · December 14, 2020, 6:41am

Which version of Elasticsearch are you using? What does your configuration look like?

ThreatInter · December 14, 2020, 6:48am

Elasticsearch version 7.9.1. How I can upload file with settings? Seems like I can attach only images to my messages

DavidTurner · December 14, 2020, 7:02am

Does this client enable TCP keepalives, and is your OS configured to send them promptly? If not, that would explain this.

ThreatInter · December 14, 2020, 7:22am

Elastic nodes have this settings:
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 10
net.ipv4.tcp_keepalive_time = 300

Client in case Apache Metron - Elastic is Elastic isn't it?

DavidTurner · December 14, 2020, 7:40am

Those settings sound good for the Elasticsearch nodes but you also need keepalives on the connection from Metron and Elasticsearch. I have no experience with Metron so can't tell you how to do that.

Another possible explanation is that Metron is configured to time out requests after 60 seconds, but the failing request simply needs longer.

ThreatInter · December 14, 2020, 7:42am

Ok, will try. Thank you for your answers.

Oh, we have trobles between ES nodes too. How we can deal with it?

[2020-12-14T12:42:43,885][DEBUG][o.e.a.s.TransportSearchAction]
...
org.elasticsearch.transport.TransportException: failure to send
...
Caused by: org.elasticsearch.tasks.TaskCancelledException: The parent task was cancelled, shouldn't start any child tasks

Christian_Dahlqvist · December 14, 2020, 7:54am

How have you secured the cluster? Are you using any third party plugins?

DavidTurner · December 14, 2020, 7:55am

The message you quote is a DEBUG message and can therefore be ignored. It indicates that a search was cancelled because the client disconnected, which is the expected behaviour.

ThreatInter · December 14, 2020, 7:58am

Just time of this DEBUG messages and errors from Metron is the same*. So I thought that they linked some way.

*We see this DEBUG messages and Metron errors about the same time

DavidTurner · December 14, 2020, 8:04am

Yes, they are. They indicate that the client disconnected.

ThreatInter · December 14, 2020, 10:35am

We set Metron's storms settings same as ES nodes:

net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 10
net.ipv4.tcp_keepalive_time = 300

And it doesn't help. We again get errors during writing to ES.
Now we also have this message:

[2020-12-14T15:48:36,265][DEBUG][o.e.m.j.JvmGcMonitorService] [h1-es01] [gc][430695] overhead, spent [191ms] collecting in the last [1s]

One of nodes left cluster

[2020-12-14T15:42:33,493][DEBUG][o.e.c.s.MasterService    ] [h1-es02] executing cluster state update for [node-left[{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{-bqqYp4XQbSAio6SjrVMlw}{h1-es01ip}{h1-es01ip}{dimr} reason: followers ch
eck retry count exceeded]]

DavidTurner · December 14, 2020, 11:16am

That may not be enough, the client needs to specifically request keepalives on each connection too. One way to check this is to run sudo netstat -anto and verify that the connections from the client do have a keepalive timer.

Also did you address this:

ThreatInter · December 14, 2020, 12:16pm

looks like connection haven't keepalive timer

system · January 11, 2021, 12:16pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch-hadoop sporadic timeouts Elasticsearch	13	557	July 6, 2017
ElasticSearch : observer: timeout notification from cluster service Elasticsearch	10	11206	July 5, 2017
Cluster health times out Elasticsearch	18	1624	July 6, 2017
Timeout notification from cluster service Elasticsearch	4	3312	July 6, 2017
Hung node, cluster state green Elasticsearch	6	1155	July 6, 2017

Elastic cluster "Received ban for the parent [channel closed]", Apache Metron "listener timeout after waiting for"

Related topics