Master node's Elasticsearch sometimes dead while ingested data from filebeat

I have a cluster with 2 nodes. Master node and data node. Both was fine back then. But it becoming a mess lately, like it is really slow ingesting csv data from filebeat, really slow kibana ui open.

These is what it shown on logstash

[WARN ] 2020-07-21 21:02:13.680 [[main]>worker1] elasticsearch - Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://127.0.0.1:9200/][Manticore::SocketTimeout] Read timed out {:url=>http://127.0.0.1:9200/, :error_message=>"Elasticsearch Unreachable: [http://127.0.0.1:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
[WARN ] 2020-07-21 21:02:15.124 [[main]>worker0] elasticsearch - Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://127.0.0.1:9200/][Manticore::SocketTimeout] Read timed out {:url=>http://127.0.0.1:9200/, :error_message=>"Elasticsearch Unreachable: [http://127.0.0.1:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
[ERROR] 2020-07-21 21:02:16.281 [[main]>worker1] elasticsearch - Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://127.0.0.1:9200/][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}
[ERROR] 2020-07-21 21:02:16.281 [[main]>worker0] elasticsearch - Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://127.0.0.1:9200/][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}
[ERROR] 2020-07-21 21:02:25.082 [[main]>worker1] elasticsearch - Attempted to send a bulk request to elasticsearch, but no there are no living connections in the connection pool. Perhaps Elasticsearch is unreachable or down? {:error_message=>"No Available connections", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError", :will_retry_in_seconds=>4}
[ERROR] 2020-07-21 21:02:29.656 [[main]>worker1] elasticsearch - Attempted to send a bulk request to elasticsearch, but no there are no living connections in the connection pool. Perhaps Elasticsearch is unreachable or down? {:error_message=>"No Available connections", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError", :will_retry_in_seconds=>8}
[ERROR] 2020-07-21 21:02:31.053 [[main]>worker0] elasticsearch - Attempted to send a bulk request to elasticsearch, but no there are no living connections in the connection pool. Perhaps Elasticsearch is unreachable or down? {:error_message=>"No Available connections", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError", :will_retry_in_seconds=>4}
[ERROR] 2020-07-21 21:02:36.502 [[main]>worker0] elasticsearch - Attempted to send a bulk request to elasticsearch, but no there are no living connections in the connection pool. Perhaps Elasticsearch is unreachable or down? {:error_message=>"No Available connections", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError", :will_retry_in_seconds=>8}
[ERROR] 2020-07-21 21:02:39.192 [[main]>worker1] elasticsearch - Attempted to send a bulk request to elasticsearch, but no there are no living connections in the connection pool. Perhaps Elasticsearch is unreachable or down? {:error_message=>"No Available connections", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError", :will_retry_in_seconds=>16}
[WARN ] 2020-07-21 21:02:49.121 [Ruby-0-Thread-5: :1] elasticsearch - Restored connection to ES instance {:url=>"http://127.0.0.1:9200/"}
[ERROR] 2020-07-21 21:02:49.723 [[main]>worker0] elasticsearch - Attempted to send a bulk request to elasticsearch, but no there are no living connections in the connection pool. Perhaps Elasticsearch is unreachable or down? {:error_message=>"No Available connections", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError", :will_retry_in_seconds=>16}
[WARN ] 2020-07-21 21:04:06.500 [[main]>worker1] elasticsearch - Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://127.0.0.1:9200/][Manticore::SocketTimeout] Read timed out {:url=>http://127.0.0.1:9200/, :error_message=>"Elasticsearch Unreachable: [http://127.0.0.1:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
[ERROR] 2020-07-21 21:04:06.707 [[main]>worker1] elasticsearch - Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://127.0.0.1:9200/][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>32}
[ERROR] 2020-07-21 21:04:27.520 [[main]>worker0] elasticsearch - Attempted to send a bulk request to elasticsearch, but no there are no living connections in the connection pool. Perhaps Elasticsearch is unreachable or down? {:error_message=>"No Available connections", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError", :will_retry_in_seconds=>32}
[ERROR] 2020-07-21 21:04:54.871 [[main]>worker1] elasticsearch - Attempted to send a bulk request to elasticsearch, but no there are no living connections in the connection pool. Perhaps Elasticsearch is unreachable or down? {:error_message=>"No Available connections", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError", :will_retry_in_seconds=>64}
[WARN ] 2020-07-21 21:04:55.353 [Ruby-0-Thread-5: :1] elasticsearch - Restored connection to ES instance {:url=>"http://127.0.0.1:9200/"}

filebeat

2020-07-21T20:23:50.051+0700    DEBUG   [input] input/input.go:152      Run input
2020-07-21T20:23:50.051+0700    DEBUG   [input] log/input.go:191        Start next scan
2020-07-21T20:23:50.055+0700    DEBUG   [input] log/input.go:421        Check file for harvesting: C:\Program Files\Filebeat\call-log-june28\Call Record 12 - 18 July 2020-CBN PJ.csv
2020-07-21T20:23:50.056+0700    DEBUG   [input] log/input.go:511        Update existing file for harvesting: C:\Program Files\Filebeat\call-log-june28\Call Record 12 - 18 July 2020-CBN PJ.csv, offset: 179504
2020-07-21T20:23:50.056+0700    DEBUG   [input] log/input.go:563        Harvester for file is still running: C:\Program Files\Filebeat\call-log-june28\Call Record 12 - 18 July 2020-CBN PJ.csv
2020-07-21T20:23:50.056+0700    DEBUG   [input] log/input.go:212        input states cleaned up. Before: 1, After: 1, Pending: 0
2020-07-21T20:24:00.057+0700    DEBUG   [input] input/input.go:152      Run input
2020-07-21T20:24:00.057+0700    DEBUG   [input] log/input.go:191        Start next scan
2020-07-21T20:24:00.059+0700    DEBUG   [input] log/input.go:421        Check file for harvesting: C:\Program Files\Filebeat\call-log-june28\Call Record 12 - 18 July 2020-CBN PJ.csv
2020-07-21T20:24:00.059+0700    DEBUG   [input] log/input.go:511        Update existing file for harvesting: C:\Program Files\Filebeat\call-log-june28\Call Record 12 - 18 July 2020-CBN PJ.csv, offset: 179504
2020-07-21T20:24:00.059+0700    DEBUG   [input] log/input.go:563        Harvester for file is still running: C:\Program Files\Filebeat\call-log-june28\Call Record 12 - 18 July 2020-CBN PJ.csv
2020-07-21T20:24:00.059+0700    DEBUG   [input] log/input.go:212        input states cleaned up. Before: 1, After: 1, Pending: 0
2020-07-21T20:24:03.482+0700    DEBUG   [transport]     transport/client.go:205 handle error: read tcp 10.64.233.87:57431->10.64.2.246:5044: i/o timeout
2020-07-21T20:24:03.482+0700    ERROR   [logstash]      logstash/async.go:279   Failed to publish events caused by: read tcp 10.64.233.87:57431->10.64.2.246:5044: i/o timeout
2020-07-21T20:24:03.490+0700    DEBUG   [transport]     transport/client.go:118 closing
2020-07-21T20:24:03.491+0700    ERROR   [logstash]      logstash/async.go:279   Failed to publish events caused by: read tcp 10.64.233.87:57431->10.64.2.246:5044: i/o timeout
2020-07-21T20:24:03.491+0700    DEBUG   [logstash]      logstash/async.go:171   1249 events out of 1249 events sent to logstash host 10.64.2.246:5044. Continue sending
2020-07-21T20:24:03.491+0700    INFO    [publisher]     pipeline/retry.go:173   retryer: send wait signal to consumer
2020-07-21T20:24:03.491+0700    ERROR   [logstash]      logstash/async.go:279   Failed to publish events caused by: read tcp 10.64.233.87:57431->10.64.2.246:5044: i/o timeout
2020-07-21T20:24:03.492+0700    INFO    [publisher]     pipeline/retry.go:175     done
2020-07-21T20:24:03.560+0700    DEBUG   [logstash]      logstash/async.go:171   1512 events out of 1512 events sent to logstash host 10.64.2.246:5044. Continue sending
2020-07-21T20:24:03.560+0700    DEBUG   [logstash]      logstash/async.go:127   close connection
2020-07-21T20:24:03.562+0700    ERROR   [logstash]      logstash/async.go:279   Failed to publish events caused by: client is not connected
2020-07-21T20:24:03.562+0700    DEBUG   [logstash]      logstash/async.go:127   close connection
2020-07-21T20:24:04.942+0700    ERROR   [publisher_pipeline_output]     pipeline/output.go:127  Failed to publish events: client is not connected
2020-07-21T20:24:04.942+0700    INFO    [publisher_pipeline_output]     pipeline/output.go:101  Connecting to backoff(async(tcp://10.64.2.246:5044))
2020-07-21T20:24:04.943+0700    DEBUG   [logstash]      logstash/async.go:119   connect
2020-07-21T20:24:04.954+0700    INFO    [publisher_pipeline_output]     pipeline/output.go:111  Connection to backoff(async(tcp://10.64.2.246:5044)) established
2020-07-21T20:24:04.961+0700    DEBUG   [logstash]      logstash/async.go:171   55 events out of 55 events sent to logstash host 10.64.2.246:5044. Continue sending
2020-07-21T20:24:04.961+0700    INFO    [publisher]     pipeline/retry.go:196   retryer: send unwait-signal to consumer
2020-07-21T20:24:04.961+0700    INFO    [publisher]     pipeline/retry.go:198     done
2020-07-21T20:24:04.981+0700    DEBUG   [logstash]      logstash/async.go:171   1249 events out of 1249 events sent to logstash host 10.64.2.246:5044. Continue sending
2020-07-21T20:24:10.061+0700    DEBUG   [input] input/input.go:152      Run input
2020-07-21T20:24:10.061+0700    DEBUG   [input] log/input.go:191        Start next scan
2020-07-21T20:24:10.063+0700    DEBUG   [input] log/input.go:421        Check file for harvesting: C:\Program Files\Filebeat\call-log-june28\Call Record 12 - 18 July 2020-CBN PJ.csv
2020-07-21T20:24:10.063+0700    DEBUG   [input] log/input.go:511        Update existing file for harvesting: C:\Program Files\Filebeat\call-log-june28\Call Record 12 - 18 July 2020-CBN PJ.csv, offset: 179504
2020-07-21T20:24:10.063+0700    DEBUG   [input] log/input.go:563        Harvester for file is still running: C:\Program Files\Filebeat\call-log-june28\Call Record 12 - 18 July 2020-CBN PJ.csv
2020-07-21T20:24:10.063+0700    DEBUG   [input] log/input.go:212        input states cleaned up. Before: 1, After: 1, Pending: 0
2020-07-21T20:24:12.843+0700    INFO    [monitoring]    log/log.go:145  Non-zero metrics in the last 30s        {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":24328,"time":{"ms":32}},"total":{"ticks":46171,"time":{"ms":94},"value":46171},"user":{"ticks":21843,"time":{"ms":62}}},"handles":{"open":255},"info":{"ephemeral_id":"5532913f-11e4-4cb6-a995-57ac5380eb3c","uptime":{"ms":14881709}},"memstats":{"gc_next":23980544,"memory_alloc":16322784,"memory_total":1003854240,"rss":3039232},"runtime":{"goroutines":29}},"filebeat":{"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"active":232,"batches":4,"failed":4096,"total":4328},"read":{"errors":1},"write":{"bytes":98868}},"pipeline":{"clients":1,"events":{"active":4117,"retry":5608}}},"registrar":{"states":{"current":2}}}}}

Master node's elasticsearch could die sometimes which is make me really wondering if it really some hardware problem or memory or software problem. Data node's elasticsearch never die everytime i check it out stop printing stdout from logstash. Even after lot of waitings it works, seems i cant do this everytime. I need some solution. Any respond or suggestion will be appreciated a lot. Thank you

It's almost always one of those :slight_smile:

Check your elasticsearch logs, especially the Java garbage collection logs. I've found that clusters that worked, then start having problems usually have heap problems.

1 Like

Dear Rugenl @rugenl thank you so much for your reply. Would you mind to tell me more about what you mean please? What should I do? Thank you

Very likely you are of HEAP, as that's major cause of Master freezing and whole cluster being a mess - who much RAM/heap do these nodes have an is it really one master-only node and one data-only node, as that's an unusual configuration (and no data redundancy at all).

1 Like

Dear Steve @Steve_Mushero thank you so much for your respond. Yes it only have 2 nodes. Master node and data node. It was a single cluster node, then i decide to add data node on a running cluster.

Elasticsearch is written that.

Nodes: 2
Disk Available
84.78%
85.6 GB / 100.9 GB
JVM Heap
44.73%
916.1 MB / 2.0 GB

Kibana with red status

Instances: 1
Connections
0
Memory Usage
4.92%
71.7 MB / 1.4 GB

according to the memory allocation is it alright??

So if you have one node and added a 'data' node then you have two nodes, one master/data and one data. Which makes more sense. Make sure the master node has more RAM, and if they both have it, make both mater (but need a voting master also if V7, so you have three).

As @rugenl notes, check the Java garbage collection logs to see how long GC is taking. Something is stalling, which is usually GC, but also do you have any swapping enabled on these VMs or happening?

And you mention "Master node's elasticsearch could die sometimes" - what does that mean, die how? What is in its logs like of heap, or kernel logs like OOM or something, etc.

Dear Steve,
here is the memory allocation on master-node server

free -m
              total        used        free      shared  buff/cache   available
Mem:           1819        1613          72           2         132          65
Swap:          4095         847        3248

here is the memory allocation on data-node server

free -m
              total        used        free      shared  buff/cache   available
Mem:           7821        1606        2864         408        3350        5509
Swap:          2047           0        2047

Judging for this state is it fine to say that, the causes of this problem is master-node's ram allocation?

=====================================================================

Was shown by filebeat as elasticsearch cluster is dead and will reconnected after lot of tries later. same thing shown by logstash stdout too.

i will read on java garbage collection real soon.
Thank you so much for your kind responses.

Your master VM has less than 2GB of RAM and is swapping? That will certainly be a problem and cause theses issues, especially if there is a 2GB heap (can't tel if that's per node or total). Is memlock enabled? I assume so but still not much RAM.

Not clear if you running Logstash or where Kibana is running but ideally you'd have two nodes of the same RAM, such as 8GB and start from there, with 4GB heap for ES and put Kibana somewhere else (though it's not a big issue).

1 Like

Logstash and kibana running on the same server as master-node.
I will try to add more ram, and i hope this solve the problem. Thank you for your suggestion. Hope this will will work.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.