Elasticsearch crashes after a few minutes on VM

I have two identical VM where ELK 6.6.2 is running, they are independent and not configured as a cluster.
On one Elasticsearch is running fine, on the other Elasticsearch crashes a few minutes after starting.
Here is the OS version.

uname -a
Linux XXXXXXXXXX 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Here is the ELK version.

/usr/share/elasticsearch/bin/elasticsearch --version
Version: 6.6.2, Build: default/rpm/3bd3e59/2019-03-06T15:16:26.864148Z, JVM: 1.8.0_201

/usr/share/kibana/bin/kibana --version
6.6.2

/usr/share/logstash/bin/logstash --version
logstash 6.6.2

In messages I can see

Mar 22 18:48:05 XXXXXXXXXX systemd: Started Elasticsearch.

Mar 22 18:59:28 XXXXXXXXXX logstash: [2022-03-22T18:59:28,054][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://X.X.X.X:9200/][Manticore::SocketTimeout] Read timed out {:url=>http://X.X.X.X:9200/, :error_message=>"Elasticsearch Unreachable: [http://X.X.X.X:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
Mar 22 18:59:28 XXXXXXXXXX logstash: [2022-03-22T18:59:28,053][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://X.X.X.X:9200/][Manticore::SocketTimeout] Read timed out {:url=>http://X.X.X.X:9200/, :error_message=>"Elasticsearch Unreachable: [http://X.X.X.X:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
Mar 22 18:59:28 XXXXXXXXXX logstash: [2022-03-22T18:59:28,054][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://X.X.X.X:9200/][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}
Mar 22 18:59:28 XXXXXXXXXX logstash: [2022-03-22T18:59:28,055][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://X.X.X.X:9200/][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}
Mar 22 18:59:30 XXXXXXXXXX logstash: [2022-03-22T18:59:30,060][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch, but no there are no living connections in the connection pool. Perhaps Elasticsearch is unreachable or down? {:error_message=>"No Available connections", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError", :will_retry_in_seconds=>4}
Mar 22 18:59:30 XXXXXXXXXX logstash: [2022-03-22T18:59:30,063][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch, but no there are no living connections in the connection pool. Perhaps Elasticsearch is unreachable or down? {:error_message=>"No Available connections", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError", :will_retry_in_seconds=>4}
Mar 22 18:59:31 XXXXXXXXXX systemd: elasticsearch.service: main process exited, code=killed, status=6/ABRT
Mar 22 18:59:31 XXXXXXXXXX systemd: Unit elasticsearch.service entered failed state.
Mar 22 18:59:31 XXXXXXXXXX kibana: {"type":"log","@timestamp":"2022-03-22T17:59:31Z","tags":["error","elasticsearch","admin"],"pid":7688,"message":"Request error, retrying\nGET http://localhost:9200/_nodes/_local?filter_path=nodes.*.settings.tribe => read ECONNRESET"}
Mar 22 18:59:31 XXXXXXXXXX systemd: elasticsearch.service failed.

A file has been generated hs_err_pid6489.log.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f0510c3bb52, pid=6489, tid=0x00007f050c89a700
#
# JRE version: OpenJDK Runtime Environment (8.0_201-b09) (build 1.8.0_201-b09)
# Java VM: OpenJDK 64-Bit Server VM (25.201-b09 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x692b52]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

---------------  T H R E A D  ---------------

Current thread (0x00007f0508116800):  VMThread [stack: 0x00007f050c79b000,0x00007f050c89b000] [id=6913]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000040

The full file : 1 https://pastebin.com/F2vrpJFg, 2 https://pastebin.com/DFPfUVyU, 3 https://pastebin.com/9sJ38cS0

I have compared the Elasticsearch config files between both VM they are the same.
Can anyone help find a solution?

What's is the heap size and what kind of operations are being done on the impacted node?

It is 1G on each VM.
-Xms1g
-Xmx1g

Each VM receives Syslog from the same server.
That server acts as a Syslog concentrator far the whole network and forward the syslog to each ELK VM.

6.X is EOL and you really need to upgrade as a matter of urgency. 8.1 is latest.

Can you check in /var/log/elasticsearch/ for anything?

I know I plan on doing that as soon as this issue is solved.
In Elasticsearch.log, I can see a lot of things about "No shard available" or "all shards failed"

[2022-03-22T18:58:11,746][DEBUG][o.e.a.s.TransportSearchAction] [xjczUU_] All shards failed for phase: [query]
[2022-03-22T18:58:11,747][WARN ][r.suppressed             ] [xjczUU_] path: /.kibana/doc/_count, params: {index=.kibana, type=doc}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:293) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:133) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:254) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:101) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.search.InitialSearchPhase.access$100(InitialSearchPhase.java:48) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.search.InitialSearchPhase$2.lambda$onFailure$1(InitialSearchPhase.java:221) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.search.InitialSearchPhase.maybeFork(InitialSearchPhase.java:175) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.search.InitialSearchPhase.access$000(InitialSearchPhase.java:48) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.search.InitialSearchPhase$2.onFailure(InitialSearchPhase.java:221) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:53) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:462) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1103) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1215) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1189) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:60) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.support.HandledTransportAction$ChannelActionListener.onFailure(HandledTransportAction.java:112) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.search.SearchService$2.onFailure(SearchService.java:368) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:362) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:356) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1117) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.6.2.jar:6.6.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_201]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_201]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
[2022-03-22T18:58:12,623][WARN ][o.e.m.j.JvmGcMonitorService] [xjczUU_] [gc][479] overhead, spent [2.1s] collecting in the last [2.9s]
[2022-03-22T18:58:13,706][WARN ][o.e.m.j.JvmGcMonitorService] [xjczUU_] [gc][480] overhead, spent [1s] collecting in the last [1s]
[2022-03-22T18:58:13,712][WARN ][r.suppressed             ] [xjczUU_] path: /.kibana/doc/kql-telemetry%3Akql-telemetry, params: {index=.kibana, id=kql-telemetry:kql-telemetry, type=doc}
org.elasticsearch.action.NoShardAvailableActionException: No shard available for [get [.kibana][doc][kql-telemetry:kql-telemetry]: routing [null]]
	at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.perform(TransportSingleShardAction.java:230) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.onFailure(TransportSingleShardAction.java:217) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.access$1200(TransportSingleShardAction.java:143) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleException(TransportSingleShardAction.java:273) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1103) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1215) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1189) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:60) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.support.HandledTransportAction$ChannelActionListener.onFailure(HandledTransportAction.java:112) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$1.onFailure(TransportSingleShardAction.java:110) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:744) [elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39) [elasticsearch-6.6.2.jar:6.6.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_201]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_201]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
Caused by: org.elasticsearch.transport.RemoteTransportException: [xjczUU_][10.4.56.32:9300][indices:data/read/get[s]]
Caused by: org.elasticsearch.index.shard.IllegalIndexShardStateException: CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED]
	at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:1550) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.index.shard.IndexShard.get(IndexShard.java:911) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.index.get.ShardGetService.innerGet(ShardGetService.java:169) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:90) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:82) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.get.TransportGetAction.shardOperation(TransportGetAction.java:89) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.get.TransportGetAction.shardOperation(TransportGetAction.java:43) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$1.doRun(TransportSingleShardAction.java:115) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.6.2.jar:6.6.2]
	... 3 more

In gc.log.0.current I can see a lot of "Full GC (Allocation Failure)"

2022-03-22T18:59:26.622+0100: 679,987: [CMS-concurrent-mark-start]
2022-03-22T18:59:26.625+0100: 679,990: [Full GC (Allocation Failure) 2022-03-22T18:59:26.625+0100: 679,990: [CMS2022-03-22T18:59:26.884+0100: 680,250: [CMS-concurrent-mark: 0,262/0,263 secs] [Times: user=0,27 sys=0,00, real=0,26 secs] 
 (concurrent mode failure): 707836K->707751K(707840K), 0,9708608 secs] 1014524K->1014299K(1014528K), [Metaspace: 82726K->82726K(1126400K)], 0,9709615 secs] [Times: user=0,97 sys=0,00, real=0,97 secs] 
2022-03-22T18:59:27.596+0100: 680,961: Total time for which application threads were stopped: 0,9714508 seconds, Stopping threads took: 0,0001083 seconds
2022-03-22T18:59:27.597+0100: 680,962: [Full GC (Allocation Failure) 2022-03-22T18:59:27.597+0100: 680,962: [CMS: 707751K->707751K(707840K), 0,6939964 secs] 1014364K->1014316K(1014528K), [Metaspace: 82726K->82726K(1126400K)], 0,6940848 secs] [Times: user=0,70 sys=0,00, real=0,70 secs] 
2022-03-22T18:59:28.291+0100: 681,656: [Full GC (Allocation Failure) 2022-03-22T18:59:28.291+0100: 681,656: [CMS: 707751K->707654K(707840K), 0,8603461 secs] 1014316K->1014078K(1014528K), [Metaspace: 82726K->82726K(1126400K)], 0,8604358 secs] [Times: user=0,85 sys=0,00, real=0,86 secs] 
2022-03-22T18:59:29.152+0100: 682,517: Total time for which application threads were stopped: 1,5550843 seconds, Stopping threads took: 0,0001108 seconds
2022-03-22T18:59:29.161+0100: 682,526: [Full GC (Allocation Failure) 2022-03-22T18:59:29.161+0100: 682,526: [CMS: 707833K->707690K(707840K), 0,8402205 secs] 1014521K->1014158K(1014528K), [Metaspace: 82734K->82734K(1126400K)], 0,8403201 secs] [Times: user=0,85 sys=0,00, real=0,84 secs] 
2022-03-22T18:59:30.001+0100: 683,366: Total time for which application threads were stopped: 0,8443637 seconds, Stopping threads took: 0,0035616 seconds
2022-03-22T18:59:30.004+0100: 683,369: [Full GC (Allocation Failure) 2022-03-22T18:59:30.004+0100: 683,369: [CMS: 707708K->707693K(707840K), 0,7008103 secs] 1014396K->1014199K(1014528K), [Metaspace: 82743K->82743K(1126400K)], 0,7009064 secs] [Times: user=0,70 sys=0,00, real=0,70 secs] 
2022-03-22T18:59:30.705+0100: 684,070: Total time for which application threads were stopped: 0,7013744 seconds, Stopping threads took: 0,0001182 seconds

You need to increase heap a bit for sure . 1gb is quite low . You can give somewhere between 25-50% ( between 4 to 31 ) gb based on how much RAM you have .

We'd need to see more logs please, nothing in there indicates a crash of any kind, or an error.

Thanks for the recommendation.
I will do that on both VM as soon as this issue is solved.

Here is all of messages.log
1: https://pastebin.com/PdU3rVUp
2: https://pastebin.com/NaPzWQ8U
3: https://pastebin.com/EXfuGifa
4: https://pastebin.com/ATbGp9XB

Elasticsearch.log
1: https://pastebin.com/7zDQprWe
2: https://pastebin.com/U08qvxP0
3: https://pastebin.com/DJq414CT
4: https://pastebin.com/gGE4rEqb
5: https://pastebin.com/hctLABwf

gc.log.0.current
1: https://pastebin.com/9DyyzPSn
2: https://pastebin.com/FnDjqtiy

Let me know if you want some other logs.

Thanks. There's no evidence of a crash in there that I can see.
If you're really running 6.6.2 you need to upgrade, it's well past EOL as I mentioned.

Can you provide the output from the _cluster/stats?pretty&human API?

I agree with you 6.6.2 will be upgraded but first I need to understand why another 6.6.2 with the same config is running smoothly.
As Elasticsearch is crashed _cluster/stats?pretty&human is refused.
Is there another way to get the information without curl?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.