Elastic cluster "Received ban for the parent [channel closed]", Apache Metron "listener timeout after waiting for"

Hello, we using ES cluster with 3 nodes, each datanode. We faced with problem that sometimes we can't write into our cluster using Apache Metron :

java.io.IOException: listener timeout after waiting for [60000] ms at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:660) at org.elasticsearch.client.RestClient.performRequest(RestClient.java:219) at org.elasticsearch.client.RestClient.performRequest(RestClient.java:191) at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:396) at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:382) at org.elasticsearch.client.RestHighLevelClient.bulk(RestHighLevelClient.java:197) at org.apache.metron.elasticsearch.bulk.ElasticsearchBulkDocumentWriter.write(ElasticsearchBulkDocumentWriter.java:89) at org.apache.metron.elasticsearch.writer.ElasticsearchWriter.write(ElasticsearchWriter.java:105) at org.apache.metron.writer.BulkWriterComponent.flush(BulkWriterComponent.java:123) at org.apache.metron.writer.BulkWriterComponent.applyShouldFlush(BulkWriterComponent.java:179) at org.apache.metron.writer.BulkWriterComponent.write(BulkWriterComponent.java:99) at org.apache.metron.writer.bolt.BulkMessageWriterBolt.execute(BulkMessageWriterBolt.java:303) at org.apache.storm.daemon.executor$fn__10219$tuple_action_fn__10221.invoke(executor.clj:745) at org.apache.storm.daemon.executor$mk_task_receiver$fn__10138.invoke(executor.clj:473) at org.apache.storm.disruptor$clojure_handler$reify__4115.onEvent(disruptor.clj:41) at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:509) at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:487) at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:74) at org.apache.storm.daemon.executor$fn__10219$fn__10232$fn__10287.invoke(executor.clj:868) at org.apache.storm.util$async_loop$fn__1221.invoke(util.clj:484) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:748)

At this time at cluster nodes we get this messages:

[2020-12-14T10:00:04,292][DEBUG][o.e.t.TaskCancellationService] [h1-es03] Received ban for the parent [MT3BSgtaQBWux8BJDBSsHg:53280489] on the node [Qshtg7-TQIyxeiccpkmlIA], reason: [channel closed]
[2020-12-14T10:00:04,463][DEBUG][o.e.t.TaskCancellationService] [h1-es03] Received ban for the parent [MT3BSgtaQBWux8BJDBSsHg:53280512] on the node [Qshtg7-TQIyxeiccpkmlIA], reason: [channel closed]
[2020-12-14T10:00:04,523][DEBUG][o.e.t.TaskCancellationService] [h1-es03] Received ban for the parent [MT3BSgtaQBWux8BJDBSsHg:53280620] on the node [Qshtg7-TQIyxeiccpkmlIA], reason: [channel closed]
[2020-12-14T10:00:04,631][DEBUG][o.e.t.TaskCancellationService] [h1-es03] Received ban for the parent [MT3BSgtaQBWux8BJDBSsHg:53280799] on the node [Qshtg7-TQIyxeiccpkmlIA], reason: [channel closed]
[2020-12-14T10:00:04,689][DEBUG][o.e.t.TaskCancellationService] [h1-es03] Received ban for the parent [MT3BSgtaQBWux8BJDBSsHg:53280902] on the node [Qshtg7-TQIyxeiccpkmlIA], reason: [channel closed]
[2020-12-14T10:00:04,857][DEBUG][o.e.t.TaskCancellationService] [h1-es03] Received ban for the parent [MT3BSgtaQBWux8BJDBSsHg:53281046] on the node [Qshtg7-TQIyxeiccpkmlIA], reason: [channel closed]

And also:

org.elasticsearch.transport.TransportException: failure to send
...
Caused by: org.elasticsearch.tasks.TaskCancelledException: The parent task was cancelled, shouldn't start any child tasks

Have someone else faced with it? Help us to understand what's going on, please!

Which version of Elasticsearch are you using? What does your configuration look like?

Elasticsearch version 7.9.1. How I can upload file with settings? Seems like I can attach only images to my messages

Does this client enable TCP keepalives, and is your OS configured to send them promptly? If not, that would explain this.

1 Like

Elastic nodes have this settings:
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 10
net.ipv4.tcp_keepalive_time = 300

Client in case Apache Metron - Elastic is Elastic isn't it?

Those settings sound good for the Elasticsearch nodes but you also need keepalives on the connection from Metron and Elasticsearch. I have no experience with Metron so can't tell you how to do that.

Another possible explanation is that Metron is configured to time out requests after 60 seconds, but the failing request simply needs longer.

1 Like

Ok, will try. Thank you for your answers.

Oh, we have trobles between ES nodes too. How we can deal with it?

[2020-12-14T12:42:43,885][DEBUG][o.e.a.s.TransportSearchAction]
...
org.elasticsearch.transport.TransportException: failure to send
...
Caused by: org.elasticsearch.tasks.TaskCancelledException: The parent task was cancelled, shouldn't start any child tasks

How have you secured the cluster? Are you using any third party plugins?

The message you quote is a DEBUG message and can therefore be ignored. It indicates that a search was cancelled because the client disconnected, which is the expected behaviour.

2 Likes

Just time of this DEBUG messages and errors from Metron is the same*. So I thought that they linked some way.

*We see this DEBUG messages and Metron errors about the same time

Yes, they are. They indicate that the client disconnected.

1 Like

We set Metron's storms settings same as ES nodes:

net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 10
net.ipv4.tcp_keepalive_time = 300

And it doesn't help. We again get errors during writing to ES.
Now we also have this message:

[2020-12-14T15:48:36,265][DEBUG][o.e.m.j.JvmGcMonitorService] [h1-es01] [gc][430695] overhead, spent [191ms] collecting in the last [1s]

One of nodes left cluster

[2020-12-14T15:42:33,493][DEBUG][o.e.c.s.MasterService    ] [h1-es02] executing cluster state update for [node-left[{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{-bqqYp4XQbSAio6SjrVMlw}{h1-es01ip}{h1-es01ip}{dimr} reason: followers ch
eck retry count exceeded]]

That may not be enough, the client needs to specifically request keepalives on each connection too. One way to check this is to run sudo netstat -anto and verify that the connections from the client do have a keepalive timer.

Also did you address this:

looks like connection haven't keepalive timer

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.