Cluster down after upgrade

I have three Linux container as data node and three as master and it fails with following error. can't figure out why.

whole file system /s1/Elasticsearch/ndoes is clean on all container. I can run full find /s1 on all of them. it is something else can't figure out what.
memory I have assign are 70gig and JVM heap is 28 gig. cluster do not have that much data either.

all this happen when I shut it down and upgraded to 7.17.1 from 7.16.0

I even done clean reboot of all them as well

Caused by: java.io.IOException: Input/output error: NIOFSIndexInput(path="/s1/elasticsearch/nodes/0/indices/FnWCpye_TG6io5FxzlTx3g/0/index/_1g.fdt")
Caused by: java.io.IOException: Input/output error
Caused by: java.io.IOException: Input/output error: NIOFSIndexInput(path="/s1/elasticsearch/nodes/0/indices/FnWCpye_TG6io5FxzlTx3g/0/index/_1g.fdt")
Caused by: java.io.IOException: Input/output error
Caused by: java.io.IOException: Input/output error: NIOFSIndexInput(path="/s1/elasticsearch/nodes/0/indices/FnWCpye_TG6io5FxzlTx3g/0/index/_1g.fdt")
Caused by: java.io.IOException: Input/output error
java.io.IOException: Input/output error: NIOFSIndexInput(path="/s1/elasticsearch/nodes/0/indices/FnWCpye_TG6io5FxzlTx3g/0/index/_1g_Lucene80_0.dvd")
Caused by: java.io.IOException: Input/output error
[2022-03-29T23:11:39,013][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [elkd01] fatal error in thread [elasticsearch[elkd01][generic][T#4]], exiting
java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code

Is there more to the log?

How did you go about doing the upgrade?

shutdown all node, upgrade the packages and restarted all of them.
simple upgrade.

there is not much information in log.

I had all the index in data node and all three data node was having such issue. hence what I did is
assign one master to data as well. that let me bring up kibana (as kibana index started on it)

then execute following (I was not able to use curl because none of the index there)

PUT /_cluster/settings?flat_settings=true
{
"transient" : {
"indices.recovery.max_bytes_per_sec" : "10mb"
}
}

as I see that it was may be causing problem from some other thred. ( I am not sure as this container had enough memory and cpu)

then try to restart the data node but was keep dying hence remove few largest index ( I had backup of it and will restore later time)

but here is more log

[2022-03-30T01:39:25,516][INFO ][o.e.n.Node               ] [elkd01] initialized
[2022-03-30T01:39:25,517][INFO ][o.e.n.Node               ] [elkd01] starting ...
[2022-03-30T01:39:25,543][INFO ][o.e.x.s.c.f.PersistentCache] [elkd01] persistent cache index loaded
[2022-03-30T01:39:25,544][INFO ][o.e.x.d.l.DeprecationIndexingComponent] [elkd01] deprecation component started
[2022-03-30T01:39:25,630][INFO ][o.e.t.TransportService   ] [elkd01] publish_address {10.59.10.77:9300}, bound_addresses {10.59.10.77:9300}
[2022-03-30T01:39:26,377][INFO ][o.e.b.BootstrapChecks    ] [elkd01] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2022-03-30T01:39:26,380][INFO ][o.e.c.c.Coordinator      ] [elkd01] cluster UUID [fVO-T1osSbeXzNmw4ig00w]
[2022-03-30T01:39:26,877][INFO ][o.e.c.s.ClusterApplierService] [elkd01] master node changed {previous [], current [{elkm02}{OL3BNCw6Sx2lrGNGitti8g}{IGHXEiU8SkOgablMWy5hjA}{10.59.10.35}{10.59.10.35:9300}{dhimt}]}, added {{elkm03}{02E2DU5BQh6KnKeZQPKFNg}{jqGHTpnKR_u0L3xJNuEU6g}{10.59.10.37}{10.59.10.37:9300}{dhimt}, {elk01}{z7OFjxfdQnS6rrLrUrXC9A}{UYxeE7GIQci1r_NwI4b7yw}{10.59.10.80}{10.59.10.80:9300}, {elkm01}{7PDgW5xYSPmT5jIohBYz-A}{qHbMq_dOR1-NOgNI_N9-hw}{10.59.10.34}{10.59.10.34:9300}{dhimt}, {elkm02}{OL3BNCw6Sx2lrGNGitti8g}{IGHXEiU8SkOgablMWy5hjA}{10.59.10.35}{10.59.10.35:9300}{dhimt}, {elk02}{eTqjGV6ZQh-RP-p6-zvMBw}{V4oZ1ZjhSCyuQ5dq3rFIfA}{10.59.10.81}{10.59.10.81:9300}}, term: 38, version: 22198, reason: ApplyCommitRequest{term=38, version=22198, sourceNode={elkm02}{OL3BNCw6Sx2lrGNGitti8g}{IGHXEiU8SkOgablMWy5hjA}{10.59.10.35}{10.59.10.35:9300}{dhimt}{xpack.installed=true, transform.node=true}}
[2022-03-30T01:39:27,044][INFO ][o.e.c.s.ClusterSettings  ] [elkd01] updating [xpack.monitoring.collection.enabled] from [false] to [true]
[2022-03-30T01:39:27,044][INFO ][o.e.i.r.RecoverySettings ] [elkd01] using rate limit [10mb] with [default=10mb, read=0b, write=0b, max=0b]
[2022-03-30T01:39:27,194][INFO ][o.e.x.s.a.TokenService   ] [elkd01] refresh keys
[2022-03-30T01:39:27,361][INFO ][o.e.x.s.a.TokenService   ] [elkd01] refreshed keys
[2022-03-30T01:39:27,434][INFO ][o.e.l.LicenseService     ] [elkd01] license [55905ffe-33d4-4a71-be22-74c517477ae1] mode [basic] - valid
[2022-03-30T01:39:27,435][INFO ][o.e.x.s.a.Realms         ] [elkd01] license mode is [basic], currently licensed security realms are [reserved/reserved,file/default_file,native/default_native]
[2022-03-30T01:39:27,436][INFO ][o.e.x.s.s.SecurityStatusChangeListener] [elkd01] Active license is now [BASIC]; Security is enabled
[2022-03-30T01:39:27,445][INFO ][o.e.h.AbstractHttpServerTransport] [elkd01] publish_address {10.59.10.77:9200}, bound_addresses {10.59.10.77:9200}
[2022-03-30T01:39:27,445][INFO ][o.e.n.Node               ] [elkd01] started
[2022-03-30T01:39:58,075][INFO ][o.e.c.s.ClusterApplierService] [elkd01] added {{elkd03}{nuDxSZZmRmGORTAN5Z-6oA}{tnFDjsDDQP-N062ISceGtQ}{10.59.10.79}{10.59.10.79:9300}{dh}}, term: 38, version: 22216, reason: ApplyCommitRequest{term=38, version=22216, sourceNode={elkm02}{OL3BNCw6Sx2lrGNGitti8g}{IGHXEiU8SkOgablMWy5hjA}{10.59.10.35}{10.59.10.35:9300}{dhimt}{xpack.installed=true, transform.node=true}}
[2022-03-30T01:39:58,742][INFO ][o.e.c.s.ClusterApplierService] [elkd01] added {{elkd02}{PprttoFnS0yWaM_vu9EvyA}{L29PkS2uTBqcsgglHpWyDA}{10.59.10.78}{10.59.10.78:9300}{dh}}, term: 38, version: 22217, reason: ApplyCommitRequest{term=38, version=22217, sourceNode={elkm02}{OL3BNCw6Sx2lrGNGitti8g}{IGHXEiU8SkOgablMWy5hjA}{10.59.10.35}{10.59.10.35:9300}{dhimt}{xpack.installed=true, transform.node=true}}
[2022-03-30T02:02:07,060][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [elkd01] fatal error in thread [elasticsearch[elkd01][generic][T#3]], exiting
java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
        at org.apache.lucene.store.DataInput.readBytes(DataInput.java:88) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler$3.nextChunkRequest(RecoverySourceHandler.java:1383) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler$3.nextChunkRequest(RecoverySourceHandler.java:1344) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.getNextRequest(MultiChunkTransfer.java:168) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.handleItems(MultiChunkTransfer.java:131) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.access$000(MultiChunkTransfer.java:48) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer$1.write(MultiChunkTransfer.java:72) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:97) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.drainAndProcessAndRelease(AsyncIOProcessor.java:85) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:73) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.addItem(MultiChunkTransfer.java:83) ~[elasticsearch-7.17.1.jar:7.17.1]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.lambda$handleItems$3(MultiChunkTransfer.java:125) ~[elasticsearch-7.17.1.jar:7.17.1]

and it is repeated may time over "java.land.InternalError.........."

Finally it stay up for long time and suddenly die with same error again. hence I have stop allocation

cluster.routing.allocation.enable: "none"

Are you running docker?

no but same proxmox's LXC container

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.