Frequently failed to start elasticsearch service

It was running well. Since one week ago, we kept getting this kind of error as shown below. What is the cause for this?

● elasticsearch.service - Elasticsearch
Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; disabled; vendor preset: disa bled)
Active: failed (Result: exit-code) since Fri 2019-05-24 16:14:50 PDT; 350ms ago
Docs: http://www.elastic.co
Process: 141045 ExecStart=/usr/share/elasticsearch/bin/elasticsearch -p ${PID_DIR}/elasticsea rch.pid --quiet (code=exited, status=1/FAILURE)
Main PID: 141045 (code=exited, status=1/FAILURE)

May 24 16:14:48 namenode systemd[1]: Started Elasticsearch.
May 24 16:14:49 namenode elasticsearch[141045]: OpenJDK 64-Bit Server VM warning: INFO: os...2)
May 24 16:14:49 namenode elasticsearch[141045]: #
May 24 16:14:49 namenode elasticsearch[141045]: # There is insufficient memory for the Jav...e.
May 24 16:14:49 namenode elasticsearch[141045]: # Native memory allocation (mmap) failed t...y.
May 24 16:14:49 namenode elasticsearch[141045]: # An error report file with more informati...s:
May 24 16:14:49 namenode elasticsearch[141045]: # /var/log/elasticsearch/hs_err_pid141045.log
May 24 16:14:50 namenode systemd[1]: elasticsearch.service: main process exited, code=exi...URE
May 24 16:14:50 namenode systemd[1]: Unit elasticsearch.service entered failed state.
May 24 16:14:50 namenode systemd[1]: elasticsearch.service failed.

When I looked into the error message, below is what I find. How could I solve this problem?

======================================
[2019-05-23T14:31:33,271][INFO ][o.e.m.j.JvmGcMonitorService] [master-1] [gc][25243] overhead, spent [480ms] collecting in the last [1s]
[2019-05-23T14:31:52,275][INFO ][o.e.m.j.JvmGcMonitorService] [master-1] [gc][25262] overhead, spent [437ms] collecting in the last [1s]
[2019-05-23T14:32:14,280][INFO ][o.e.m.j.JvmGcMonitorService] [master-1] [gc][25284] overhead, spent [333ms] collecting in the last [1s]
[2019-05-23T16:28:35,122][ERROR][o.e.x.m.c.n.NodeStatsCollector] [master-1] collector [node_stats] timed out when collecting data
[2019-05-23T16:28:35,122][TRACE][o.e.d.z.MasterFaultDetection] [master-1] [master] failed to ping [{master-2}{6FFU3pjyTk-vxLDPmRG9mQ}{gHg5i8N9RMaqPiQ7lbciCA}{datanode1}{172.15.7.171:9300}{xpack.installed=true}], retry [1] out of [3]
org.elasticsearch.transport.RemoteTransportException: [master-2][172.15.7.171:9300][internal:discovery/zen/fd/master_ping]
Caused by: java.lang.IllegalStateException
[2019-05-23T16:28:35,472][TRACE][o.e.d.z.MasterFaultDetection] [master-1] [master] failed to ping [{master-2}{6FFU3pjyTk-vxLDPmRG9mQ}{gHg5i8N9RMaqPiQ7lbciCA}{datanode1}{172.15.7.171:9300}{xpack.installed=true}], retry [2] out of [3]
org.elasticsearch.transport.RemoteTransportException: [master-2][172.15.7.171:9300][internal:discovery/zen/fd/master_ping]
Caused by: java.lang.IllegalStateException
[2019-05-23T16:28:35,473][TRACE][o.e.d.z.MasterFaultDetection] [master-1] [master] failed to ping [{master-2}{6FFU3pjyTk-vxLDPmRG9mQ}{gHg5i8N9RMaqPiQ7lbciCA}{datanode1}{172.15.7.171:9300}{xpack.installed=true}], retry [3] out of [3]
org.elasticsearch.transport.RemoteTransportException: [master-2][172.15.7.171:9300][internal:discovery/zen/fd/master_ping]
Caused by: java.lang.IllegalStateException
[2019-05-23T16:28:35,474][DEBUG][o.e.d.z.MasterFaultDetection] [master-1] [master] failed to ping [{master-2}{6FFU3pjyTk-vxLDPmRG9mQ}{gHg5i8N9RMaqPiQ7lbciCA}{datanode1}{172.15.7.171:9300}{xpack.installed=true}], tried [3] times, each with maximum [30s] timeout
[2019-05-23T16:28:35,475][DEBUG][o.e.d.z.MasterFaultDetection] [master-1] [master] stopping fault detection against master [{master-2}{6FFU3pjyTk-vxLDPmRG9mQ}{gHg5i8N9RMaqPiQ7lbciCA}{datanode1}{172.15.7.171:9300}{xpack.installed=true}], reason [master failure, failed to ping, tried [3] times, each with maximum [30s] timeout]
[2019-05-23T16:28:35,476][INFO ][o.e.d.z.ZenDiscovery ] [master-1] master_left [{master-2}{6FFU3pjyTk-vxLDPmRG9mQ}{gHg5i8N9RMaqPiQ7lbciCA}{datanode1}{172.15.7.171:9300}{xpack.installed=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2019-05-23T16:28:35,476][WARN ][o.e.d.z.ZenDiscovery ] [master-1] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:
{data-6}{udjl1VPMTl6DypGF2WaGEw}{oCnRQFHmRm-rvgbCxFd9SQ}{datanode8}{172.15.7.178:9300}{xpack.installed=true}
{master-1}{ExWuwn0FQDy1g-wQiGCxRQ}{HFQg4YVHQGOhmKOcLrGiYA}{namenode}{172.15.7.170:9300}{xpack.installed=true}, local
{data-4}{3kQKVfs5Qp-XMQCgrAFlGg}{CwOnKgbMRq2rBeBLnzA8LQ}{datanode6}{172.15.7.176:9300}{xpack.installed=true}
{master-3}{U7VV3XIMRruqH35zBgU1Sg}{JqI2pIvqReua-truQRlKBw}{datanode2}{172.15.7.172:9300}{xpack.installed=true}
{master-2}{6FFU3pjyTk-vxLDPmRG9mQ}{gHg5i8N9RMaqPiQ7lbciCA}{datanode1}{172.15.7.171:9300}{xpack.installed=true}, master
{data-5}{BxDcy_PJTgSy40AwqKtzUA}{k0LFtSfkRiui5sAdazYYbA}{datanode7}{172.15.7.177:9300}{xpack.installed=true}
{data-1}{sz37AjCFQkyrdzudPmaLRw}{s81JWCWlRJ-b9_PM842aYA}{datanode3}{172.15.7.173:9300}{xpack.installed=true}
{data-2}{xKAxXYUxSyKxAPhuLpFj5A}{i56_rGcsQneio01d2GCPXg}{datanode4}{172.15.7.174:9300}{xpack.installed=true}
{data-3}{7PIdFIZRTZq0Nf7136FgJw}{azI3VgOlQgyneIvcVMqluw}{datanode5}{172.15.7.175:9300}{xpack.installed=true}

[2019-05-23T16:28:35,484][INFO ][o.e.x.w.WatcherService ] [master-1] stopping watch service, reason [no master node]
[2019-05-23T16:28:35,123][WARN ][o.e.t.n.Netty4Transport ] [master-1] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/172.15.7.171:34884}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-05-23T16:28:35,123][WARN ][o.e.t.n.Netty4Transport ] [master-1] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/172.15.7.171:34880}]
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:?]
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[?:?]
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[?:?]
at sun.nio.ch.IOUtil.write(IOUtil.java:51) ~[?:?]
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:388) ~[netty-transport-4.1.16.Final.jar:4.1.16.Final]
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321) [netty-transport-4.1.16.Final.jar:4.1.16.Final]

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.

It looks like to me an overloaded cluster.

What is the output of:

GET /_cat/health?v
GET /_cat/indices?v
GET /_cat/shards?v

Hi David,

Thanks for your reply. Here is what I get:

GET /_cat/health?v:
epoch      timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1558840297 20:11:37  es-prod green           9         6   1060 530    0    0        0             0                  -                100.0%

The results for the following two commands are too long, so I just give some examples here:

GET /_cat/indices?v:
green  open   ag2_vehicle_stats-2019-04-11          YIcrWW0TRYmESf-j6sTdrg   1   1       4448            0       21mb         10.5mb
green  open   guobiao_daily_stats-2019-05-03        qhDNxubDTZWNfL9hXr8Lyg   1   1      19955            0     94.3mb         47.2mb
green  open   stats1          ydUqo0miQsO7j3q5nKEAxA   1   1     401719            0      1.7gb        894.9mb
green  open   stats2     G-jRdEGXQKWjiFWv86TAfA   5   1      14860            0     11.1mb          5.5mb
green  open   stats3  4rscFwXnTmOTIdlbL85Ytw   1   1       2143            0      468kb          234kb
green  open   stats4        RHrLpIQEQC2cy--xBHHsOQ   1   1      21995            0    103.5mb
GET /_cat/shards?v:
.monitoring-es-6-2019.05.24           0     p      STARTED  2802517    1.5gb 172.15.7.176 data-4
.monitoring-es-6-2019.05.24           0     r      STARTED  2802517    1.5gb 172.15.7.178 data-6
.monitoring-es-6-2019.05.20           0     p      STARTED  3813089    1.9gb 172.15.7.176 data-4
.monitoring-es-6-2019.05.20           0     r      STARTED  3813089    1.9gb 172.15.7.177 data-5

Could you run:

GET /
GET /_cat/nodes?v

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.