Hi,
we are experimenting setting up an ELK cluster using Azure Container Instances. We successfully setup the cluster and we started using it to push log entries, generated by our distributed system, using and Azure Event Hub as a source with a custom application that reads the events and index them into the ELK cluster.
Everything seems to work fine but we are experiencing some annoying problems that involves the .kibana_8.4.1_001 and the .kibana_task_manager_8.4.1_001 indexes getting somehow corrupted.
We are observing this sequence of warnings in the ES nodes logs (the full log is at the bottom of this post):
[2022-12-27T20:16:51,061][WARN ][o.e.i.e.Engine ] [fs-sdlc-elasticsearch-003] [.kibana_task_manager_8.4.1_001][0] failed to rollback writer on close
java.nio.file.NoSuchFileException: /bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_ldsn.cfs
[2022-12-27T20:16:51,065][WARN ][o.e.i.e.Engine ] [fs-sdlc-elasticsearch-003] [.kibana_task_manager_8.4.1_001][0] failed engine [refresh failed source[api]]
java.io.IOException: read past EOF: NIOFSIndexInput(path="/bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_ldsn_1.fnm") buffer: java.nio.HeapByteBuffer[pos=0 lim=1024 cap=1024] chunkLen: 1024 end: 2331: NIOFSIndexInput(path="/bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_ldsn_1.fnm")
[2022-12-27T20:16:51,155][WARN ][o.e.i.c.IndicesClusterStateService] [fs-sdlc-elasticsearch-003] [.kibana_task_manager_8.4.1_001][0] marking and sending shard failed due to [shard failure, reason [refresh failed source[api]]]
java.io.IOException: read past EOF: NIOFSIndexInput(path="/bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_ldsn_1.fnm") buffer: java.nio.HeapByteBuffer[pos=0 lim=1024 cap=1024] chunkLen: 1024 end: 2331: NIOFSIndexInput(path="/bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_ldsn_1.fnm")
[2022-12-28T04:54:50,996][WARN ][o.e.t.ThreadPool ] [fs-sdlc-elasticsearch-003] failed to run scheduled task [org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker@632f54b3] on thread pool [same]
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
[2022-12-28T04:54:51,903][WARN ][o.e.i.e.Engine ] [fs-sdlc-elasticsearch-003] [.kibana_task_manager_8.4.1_001][0] failed to rollback writer on close
java.nio.file.NoSuchFileException: /bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_klnn.kdd
[2022-12-28T04:54:51,904][WARN ][o.e.i.e.Engine ] [fs-sdlc-elasticsearch-003] [.kibana_task_manager_8.4.1_001][0] failed engine [refresh failed source[schedule]]
java.io.IOException: read past EOF: NIOFSIndexInput(path="/bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_klpp_3.fnm") buffer: java.nio.HeapByteBuffer[pos=0 lim=1024 cap=1024] chunkLen: 1024 end: 2714: NIOFSIndexInput(path="/bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_klpp_3.fnm")
[2022-12-28T04:54:52,006][WARN ][o.e.i.IndexService ] [fs-sdlc-elasticsearch-003] [.kibana_task_manager_8.4.1_001] failed to run task refresh - suppressing re-occurring exceptions unless the exception changes
org.elasticsearch.index.engine.RefreshFailedEngineException: Refresh failed
[2022-12-28T04:54:52,006][WARN ][o.e.i.c.IndicesClusterStateService] [fs-sdlc-elasticsearch-003] [.kibana_task_manager_8.4.1_001][0] marking and sending shard failed due to [shard failure, reason [refresh failed source[schedule]]]
java.io.IOException: read past EOF: NIOFSIndexInput(path="/bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_klpp_3.fnm") buffer: java.nio.HeapByteBuffer[pos=0 lim=1024 cap=1024] chunkLen: 1024 end: 2714: NIOFSIndexInput(path="/bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_klpp_3.fnm")
[2022-12-28T04:55:13,064][WARN ][o.e.i.r.PeerRecoveryTargetService] [fs-sdlc-elasticsearch-003] error while listing local files, recovering as if there are none
org.apache.lucene.index.CorruptIndexException: failed engine (reason: [refresh failed source[schedule]]) (resource=preexisting_corruption)
[2022-12-28T13:37:13,189][WARN ][o.e.t.ThreadPool ] [fs-sdlc-elasticsearch-003] failed to run scheduled task [org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker@632f54b3] on thread pool [same]
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
[2022-12-28T13:37:13,188][WARN ][o.e.i.e.Engine ] [fs-sdlc-elasticsearch-003] [.kibana_task_manager_8.4.1_001][0] failed to rollback writer on close
java.nio.file.NoSuchFileException: /bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_ko6v.cfs
[2022-12-28T13:37:13,191][WARN ][o.e.i.e.Engine ] [fs-sdlc-elasticsearch-003] [.kibana_task_manager_8.4.1_001][0] failed engine [refresh failed source[schedule]]
java.io.IOException: read past EOF: NIOFSIndexInput(path="/bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_ko6v_1.fnm") buffer: java.nio.HeapByteBuffer[pos=0 lim=1024 cap=1024] chunkLen: 1024 end: 2714: NIOFSIndexInput(path="/bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_ko6v_1.fnm")
[2022-12-28T13:37:13,335][WARN ][o.e.i.c.IndicesClusterStateService] [fs-sdlc-elasticsearch-003] [.kibana_task_manager_8.4.1_001][0] marking and sending shard failed due to [shard failure, reason [refresh failed source[schedule]]]
java.io.IOException: read past EOF: NIOFSIndexInput(path="/bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_ko6v_1.fnm") buffer: java.nio.HeapByteBuffer[pos=0 lim=1024 cap=1024] chunkLen: 1024 end: 2714: NIOFSIndexInput(path="/bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_ko6v_1.fnm")
[2022-12-28T13:37:13,335][WARN ][o.e.i.IndexService ] [fs-sdlc-elasticsearch-003] [.kibana_task_manager_8.4.1_001] failed to run task refresh - suppressing re-occurring exceptions unless the exception changes
org.elasticsearch.index.engine.RefreshFailedEngineException: Refresh failed
We tried manually force merge and refresh the .kibana_task_manager_8.4.1_001 index in the hope it would fix the corruption but, this seems to have no effect on the problem.
We setup the store SMB plugin too as, although the docker images we are using (Bitnami) are Linux based, we do not know what host runs them (Windows Server we suppose?). Also, these images use OpenJDK and, because of current limitations of Azure Container Instances, we have to use Azure File Shares (via SMB) to store data and configuration.
However, it seems the use of the plugin does have no effect on the issue. This is expected as it seems the combination Windows + OpenJDK + SMB only affects writing performances as the cache manager is bypassed?
The curious thing is that it seems only these 2 indexes are affected, while the others work just fine.
We suspect the indexes are auto-merged, especially the .kibana_task_manager_8.4.1_001 one, as there seems to be frequent updates to the 39 documents it contains that result in a lot of deleted documents, with the index size growing constantly. We also observed, looking at the index stats, that running 2 Kibana instances at the same time causes several index failures to the .kibana_task_manager_8.4.1_001 index. Although exceptions still occurs when running a single Kibana instance but with no index failures. We suppose this is due to concurrency and it's by design?
We also observed the shards that make up these two indexes becoming unallocated and then we are forced to delete them and recreate the indexes to sort the problem.
We are unsure if this problem is related to the deployment we chose as it seems to be limited to these 2 indexes.
Could you please suggest how we can better diagnose this problem? We would like to keep using Azure Container Instances rather than use a different hosting.
Many thanks,
Sebastiano
Below you can find the configuration files for the ES node and the Kibana instance.
Cluster configuration:
- 4 Elastic search nodes based on the Bitnami ES 8.4.1 docker image, with roles: data, ingest , remote_cluster_client , master
- 2 Kibana instances based on the Bitnami 8.4.1 docker image
- 1 Azure Storage Account with an Azure file share for each of the ES nodes and Kibana instances to store the data
- 1 Azure storage Account with an Azure file share for each of the ES nodes and Kibana instances to store the configuration
Below you can find the elasticsearch.yml configuration file for one of the ES nodes:
node:
name: elasticsearch-004
roles: [ data, ingest , remote_cluster_client , master ]
cluster:
name: elasticsearch-cluster
discovery:
type: multi-node
seed_hosts:
- elasticsearch-001.xyz.co.uk
- elasticsearch-002.xyz.co.uk
- elasticsearch-003.xyz.co.uk
- elasticsearch-004.xyz.co.uk
network:
host: 0.0.0.0
http:
port: 9200
transport:
port: 9300
path:
data: /bitnami/elasticsearch/data
node.store.allow_mmap: false
index.store.type: smb_nio_fs
http.cors.enabled: true
http.cors.allow-origin: "*"
http.cors.allow-methods: OPTIONS, HEAD, GET, POST, PUT, DELETE
http.cors.allow-headers: X-Requested-With, X-Auth-Token, Content-Type, Content-Length, Authorization, Access-Control-Allow-Headers, Accept, x-elastic-client-meta
ingest.geoip.downloader.enabled: false
xpack.ml.enabled: false
xpack.security.enabled: true
xpack.security.authc.api_key.enabled: true
xpack.monitoring.collection.enabled: true
# Enable encryption and mutual authentication between cluster nodes
xpack.security.transport.ssl:
enabled: true
verification_mode: certificate
client_authentication: required
keystore.path: certs/elastic-certificates.p12
truststore.path: certs/elastic-certificates.p12
Below you can find the kibana configuration file for one of the Kibana instances:
path:
data: /bitnami/kibana/data
pid:
file: /opt/bitnami/kibana/tmp/kibana.pid
server:
host: 0.0.0.0
port: 5601
name: kibana-001
elasticsearch:
username: "kibana_system"
password: "xxx"
sniffInterval: 1000
sniffOnStart: true
sniffOnConnectionFault: true
hosts:
- http://elasticsearch-001.xyz.co.uk:9200
- http://elasticsearch-002.xyz.co.uk:9200
- http://elasticsearch-003.xyz.co.uk:9200
- http://elasticsearch-004.xyz.co.uk:9200
xpack.reporting.roles.enabled: false
xpack.security.encryptionKey: "xxx"
xpack.reporting.encryptionKey: "xxx"
xpack.encryptedSavedObjects.encryptionKey: "xxx"
xpack.reporting.kibanaServer.hostname: localhost
xpack.screenshotting.browser.chromium.disableSandbox: true