Cluster down after upgrade

elasticforme · March 29, 2022, 7:18pm

I have three Linux container as data node and three as master and it fails with following error. can't figure out why.

whole file system /s1/Elasticsearch/ndoes is clean on all container. I can run full find /s1 on all of them. it is something else can't figure out what.
memory I have assign are 70gig and JVM heap is 28 gig. cluster do not have that much data either.

all this happen when I shut it down and upgraded to 7.17.1 from 7.16.0

I even done clean reboot of all them as well

Caused by: java.io.IOException: Input/output error: NIOFSIndexInput(path="/s1/elasticsearch/nodes/0/indices/FnWCpye_TG6io5FxzlTx3g/0/index/_1g.fdt")
Caused by: java.io.IOException: Input/output error
Caused by: java.io.IOException: Input/output error: NIOFSIndexInput(path="/s1/elasticsearch/nodes/0/indices/FnWCpye_TG6io5FxzlTx3g/0/index/_1g.fdt")
Caused by: java.io.IOException: Input/output error
Caused by: java.io.IOException: Input/output error: NIOFSIndexInput(path="/s1/elasticsearch/nodes/0/indices/FnWCpye_TG6io5FxzlTx3g/0/index/_1g.fdt")
Caused by: java.io.IOException: Input/output error
java.io.IOException: Input/output error: NIOFSIndexInput(path="/s1/elasticsearch/nodes/0/indices/FnWCpye_TG6io5FxzlTx3g/0/index/_1g_Lucene80_0.dvd")
Caused by: java.io.IOException: Input/output error
[2022-03-29T23:11:39,013][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [elkd01] fatal error in thread [elasticsearch[elkd01][generic][T#4]], exiting
java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code

warkolm · March 29, 2022, 10:27pm

Is there more to the log?

How did you go about doing the upgrade?

elasticforme · March 30, 2022, 12:02am

shutdown all node, upgrade the packages and restarted all of them.
simple upgrade.

there is not much information in log.

I had all the index in data node and all three data node was having such issue. hence what I did is
assign one master to data as well. that let me bring up kibana (as kibana index started on it)

then execute following (I was not able to use curl because none of the index there)

PUT /_cluster/settings?flat_settings=true
{
"transient" : {
"indices.recovery.max_bytes_per_sec" : "10mb"
}
}

as I see that it was may be causing problem from some other thred. ( I am not sure as this container had enough memory and cpu)

then try to restart the data node but was keep dying hence remove few largest index ( I had backup of it and will restore later time)

but here is more log

[2022-03-30T01:39:25,516][INFO ][o.e.n.Node               ] [elkd01] initialized
[2022-03-30T01:39:25,517][INFO ][o.e.n.Node               ] [elkd01] starting ...
[2022-03-30T01:39:25,543][INFO ][o.e.x.s.c.f.PersistentCache] [elkd01] persistent cache index loaded
[2022-03-30T01:39:25,544][INFO ][o.e.x.d.l.DeprecationIndexingComponent] [elkd01] deprecation component started
[2022-03-30T01:39:25,630][INFO ][o.e.t.TransportService   ] [elkd01] publish_address {10.59.10.77:9300}, bound_addresses {10.59.10.77:9300}
[2022-03-30T01:39:26,377][INFO ][o.e.b.BootstrapChecks    ] [elkd01] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2022-03-30T01:39:26,380][INFO ][o.e.c.c.Coordinator      ] [elkd01] cluster UUID [fVO-T1osSbeXzNmw4ig00w]
[2022-03-30T01:39:26,877][INFO ][o.e.c.s.ClusterApplierService] [elkd01] master node changed {previous [], current [{elkm02}{OL3BNCw6Sx2lrGNGitti8g}{IGHXEiU8SkOgablMWy5hjA}{10.59.10.35}{10.59.10.35:9300}{dhimt}]}, added {{elkm03}{02E2DU5BQh6KnKeZQPKFNg}{jqGHTpnKR_u0L3xJNuEU6g}{10.59.10.37}{10.59.10.37:9300}{dhimt}, {elk01}{z7OFjxfdQnS6rrLrUrXC9A}{UYxeE7GIQci1r_NwI4b7yw}{10.59.10.80}{10.59.10.80:9300}, {elkm01}{7PDgW5xYSPmT5jIohBYz-A}{qHbMq_dOR1-NOgNI_N9-hw}{10.59.10.34}{10.59.10.34:9300}{dhimt}, {elkm02}{OL3BNCw6Sx2lrGNGitti8g}{IGHXEiU8SkOgablMWy5hjA}{10.59.10.35}{10.59.10.35:9300}{dhimt}, {elk02}{eTqjGV6ZQh-RP-p6-zvMBw}{V4oZ1ZjhSCyuQ5dq3rFIfA}{10.59.10.81}{10.59.10.81:9300}}, term: 38, version: 22198, reason: ApplyCommitRequest{term=38, version=22198, sourceNode={elkm02}{OL3BNCw6Sx2lrGNGitti8g}{IGHXEiU8SkOgablMWy5hjA}{10.59.10.35}{10.59.10.35:9300}{dhimt}{xpack.installed=true, transform.node=true}}
[2022-03-30T01:39:27,044][INFO ][o.e.c.s.ClusterSettings  ] [elkd01] updating [xpack.monitoring.collection.enabled] from [false] to [true]
[2022-03-30T01:39:27,044][INFO ][o.e.i.r.RecoverySettings ] [elkd01] using rate limit [10mb] with [default=10mb, read=0b, write=0b, max=0b]
[2022-03-30T01:39:27,194][INFO ][o.e.x.s.a.TokenService   ] [elkd01] refresh keys
[2022-03-30T01:39:27,361][INFO ][o.e.x.s.a.TokenService   ] [elkd01] refreshed keys
[2022-03-30T01:39:27,434][INFO ][o.e.l.LicenseService     ] [elkd01] license [55905ffe-33d4-4a71-be22-74c517477ae1] mode [basic] - valid
[2022-03-30T01:39:27,435][INFO ][o.e.x.s.a.Realms         ] [elkd01] license mode is [basic], currently licensed security realms are [reserved/reserved,file/default_file,native/default_native]
[2022-03-30T01:39:27,436][INFO ][o.e.x.s.s.SecurityStatusChangeListener] [elkd01] Active license is now [BASIC]; Security is enabled
[2022-03-30T01:39:27,445][INFO ][o.e.h.AbstractHttpServerTransport] [elkd01] publish_address {10.59.10.77:9200}, bound_addresses {10.59.10.77:9200}
[2022-03-30T01:39:27,445][INFO ][o.e.n.Node               ] [elkd01] started
[2022-03-30T01:39:58,075][INFO ][o.e.c.s.ClusterApplierService] [elkd01] added {{elkd03}{nuDxSZZmRmGORTAN5Z-6oA}{tnFDjsDDQP-N062ISceGtQ}{10.59.10.79}{10.59.10.79:9300}{dh}}, term: 38, version: 22216, reason: ApplyCommitRequest{term=38, version=22216, sourceNode={elkm02}{OL3BNCw6Sx2lrGNGitti8g}{IGHXEiU8SkOgablMWy5hjA}{10.59.10.35}{10.59.10.35:9300}{dhimt}{xpack.installed=true, transform.node=true}}
[2022-03-30T01:39:58,742][INFO ][o.e.c.s.ClusterApplierService] [elkd01] added {{elkd02}{PprttoFnS0yWaM_vu9EvyA}{L29PkS2uTBqcsgglHpWyDA}{10.59.10.78}{10.59.10.78:9300}{dh}}, term: 38, version: 22217, reason: ApplyCommitRequest{term=38, version=22217, sourceNode={elkm02}{OL3BNCw6Sx2lrGNGitti8g}{IGHXEiU8SkOgablMWy5hjA}{10.59.10.35}{10.59.10.35:9300}{dhimt}{xpack.installed=true, transform.node=true}}
[2022-03-30T02:02:07,060][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [elkd01] fatal error in thread [elasticsearch[elkd01][generic][T#3]], exiting
java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
        at org.apache.lucene.store.DataInput.readBytes(DataInput.java:88) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler$3.nextChunkRequest(RecoverySourceHandler.java:1383) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler$3.nextChunkRequest(RecoverySourceHandler.java:1344) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.getNextRequest(MultiChunkTransfer.java:168) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.handleItems(MultiChunkTransfer.java:131) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.access$000(MultiChunkTransfer.java:48) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer$1.write(MultiChunkTransfer.java:72) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:97) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.drainAndProcessAndRelease(AsyncIOProcessor.java:85) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:73) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.addItem(MultiChunkTransfer.java:83) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.lambda$handleItems$3(MultiChunkTransfer.java:125) ~[elasticsearch-7.17.1.jar:7.17.1]

and it is repeated may time over "java.land.InternalError.........."

Finally it stay up for long time and suddenly die with same error again. hence I have stop allocation

cluster.routing.allocation.enable: "none"

warkolm · March 30, 2022, 12:34am

Are you running docker?

elasticforme · March 31, 2022, 1:46pm

no but same proxmox's LXC container

system · April 28, 2022, 1:46pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic search data nodes kept crashing continuously Elasticsearch	8	323	August 18, 2023
Elasticsearch: Failed to obtain node locks Elasticsearch	13	26250	February 14, 2021
Upgrade from 6.8 to 7.1 - master not discovered yet: have discovered [...lists all master nodes...] Elasticsearch	3	1054	July 25, 2019
Elasticsearch issue Elasticsearch	13	2060	July 6, 2017
Damaged ES cluster after upgrade - serious problem - please help Elasticsearch	4	556	July 6, 2017

Cluster down after upgrade

Related topics