ELK stack is not stable on Azure

I am running ELK stack on Azure using official template.

Intermittently, getting this error in kibana:
{"statusCode":500,"error":"Internal Server Error","message":"An internal server error occurred"}
Also, elastic server is also not responding.

When I restart the kibana and elastic server from Azure portal, it's working fine. Why is this happening?

Empty log file: /var/log/kibana.log
Elasticsearch logs: /var/log/elasticsearch/my-cluster.log:

    	[2019-02-22T14:57:18,796][DEBUG][o.e.a.s.TransportSearchAction] [data-0] [158940] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [data-0][10.0.40.6:9300][indices:data/read/search[phase/fetch/id]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [158940]
	at org.elasticsearch.search.SearchService.getExecutor(SearchService.java:520) ~[elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.search.SearchService.runAsync(SearchService.java:374) ~[elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:563) ~[elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.action.search.SearchTransportService$11.messageReceived(SearchTransportService.java:405) ~[elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:586) [elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:577) [elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.action.search.SearchTransportService.sendExecuteFetch(SearchTransportService.java:184) [elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.action.search.SearchTransportService.sendExecuteFetch(SearchTransportService.java:174) [elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.action.search.FetchSearchPhase.executeFetch(FetchSearchPhase.java:162) [elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:144) [elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.action.search.FetchSearchPhase.access$000(FetchSearchPhase.java:44) [elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:86) [elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) [elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) [elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.6.0.jar:6.6.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
[2019-02-22T14:57:27,354][WARN ][o.e.x.w.e.ExecutionService] [data-0] failed to execute watch [FApLZ524RrSH_PWwO6mYMg_xpack_license_expiration]
[2019-02-22T14:57:44,655][WARN ][o.e.c.InternalClusterInfoService] [data-0] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout

It looks like the trial license has expired. It's not clear whether this would cause or be contributing to the other errors and behaviour you're seeing though. I would recommend addressing as soon as possible and monitoring the cluster again, to start.

@forloop
It's not even 3 days yet. Trial period will end on March 15, 2019

This message appears to indicate that it may have. What does Get License API return?

This implies a timeout between the two phases of a search request. Are you using the Scroll API?

Do you have monitoring configured on the cluster? If so, can you post some details about what that shows?

More generally, what spec of cluster have you deployed (VM SKUs, topology) and what is your use case? How much data are you indexing/searching?

Get License API:

{
  "license" : {
    "status" : "active",
    "uid" : "xxxxx",
    "type" : "trial",
    "issue_date" : "2019-02-13T22:38:00.303Z",
    "issue_date_in_millis" : 1550097480303,
    "expiry_date" : "2019-03-15T22:38:00.303Z",
    "expiry_date_in_millis" : 1552689480303,
    "max_nodes" : 1000,
    "issued_to" : "elk-dev",
    "issuer" : "elasticsearch",
    "start_date_in_millis" : -1
  }
}

I don't know, what do you mean by scroll api.

Monitoring:

Elasticsearch: Elasticsearch cluster status is yellow. Allocate missing replica shards.
Node: 1
Indices: 64
Memory: 721.8 MB / 957.7 MB
Total shards: 186
Unassigned Shards: 42
Documents: 31,280,901
Data: 14.6 GB

Kibana: Health is green

I am monitoring kubernetes cluster. Currently, I have metricbeat, filebeat and apm-server installed in it.

You should look at fixing these unassigned shards. It's likely because there are replica shards configured for indices but only one node, so Elasticsearch will not assign replica shards to the same node as that which has the primary shards.

  • What VM SKU are you running Elasticsearch on?

  • What typical load is the cluster under - what does GET /_nodes/stats return?

  • Do you have monitoring enabled on the cluster, and have those monitoring metrics indexed somewhere, perhaps on the cluster itself? If so, can you share a screenshot?

  • What disk topology has the node been deployed with?

Your single node only has a 1GB heap with 64 indices, 186 shards and 31.2 milliion documents spread across them. Seeing some monitoring metrics for the cluster would definitely help here, but on first glance, this seems very underpowered just for the number of indices and shards that you have.

The amount of heap used is around 75% of heap allocated, meaning the node is likely experiencing memory pressure and frequent garbage collection. I would recommend using a more powerful VM SKU with more RAM to allow more heap to be allocated to Elasticsearch. In addition, if the single VM running the process goes down, the cluster will be inoperable; you may want to consider running a minimum 3 master-eligible node cluster to provide high availability and avoid split-brain scenarios. With a sufficient number of replica shards for the number of primary shards for each index, if a node were to go down, the cluster would continue to operate.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.