Elasticsearch crashing every 7 days or so

gmfreak · March 6, 2019, 9:23pm

I'm running SugarCRM which uses Elasticsearch and i'm dealing with Elasticsearch crashing every week on my server. I did some digging and found some info online that talked about setting the heap size and various other memory settings for a server running CentOS. They didn't fix the problem.

I'm at a point where I've now taken a heap dump file (.hprof) and run it through Eclipse Memory Analyzer but I have no experience in this area. Based on my limited knowledge of this subject it looks like the system is crashing because of XPackLicenseState... The heap dump "Leak Suspect" says the following:

`One instance of "org.elasticsearch.license.XPackLicenseState" loaded by "java.net.FactoryURLClassLoader @ 0xcaac6678" occupies 841,632,904 (82.80%) bytes.  The memory is accumulated in one instance of "java.lang.Object[]" loaded by "<system class loader>"`

If I query for the license information I get the following:

{
  "license" : {
    "status" : "active",
    "uid" : "9f86080d-xxxx-xxxx-xxxx-xxxxxxxxxxxx", <--- redacted
    "type" : "basic",
    "issue_date" : "2018-12-28T21:29:07.850Z",
    "issue_date_in_millis" : 1546032547850,
    "max_nodes" : 1000,
    "issued_to" : "sugarcrm",
    "issuer" : "elasticsearch",
    "start_date_in_millis" : -1
  }
}

I am running Elasticsearch 6.4.2 on CentOS 7.
Elasticsearch was installed using RPM
Running as a single node right now
Server is assigned 12GB or RAM (it is a VM)

When it crashes, the following is written at the end of the log file:

[2019-03-03T20:03:47,789][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-1] fatal error in thread [elasticsearch[node-1][masterService#updateTask][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3181) ~[?:1.8.0_181]
        at java.util.concurrent.CopyOnWriteArrayList.add(CopyOnWriteArrayList.java:440) ~[?:1.8.0_181]
        at org.elasticsearch.license.XPackLicenseState.addListener(XPackLicenseState.java:307) ~[?:?]
        at org.elasticsearch.xpack.security.authz.accesscontrol.OptOutQueryCache.<init>(OptOutQueryCache.java:46) ~[?:?]
        at org.elasticsearch.xpack.security.Security.lambda$onIndexModule$15(Security.java:703) ~[?:?]
        at org.elasticsearch.xpack.security.Security$$Lambda$2704/418795023.apply(Unknown Source) ~[?:?]
        at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:378) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:475) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.indices.IndicesService.verifyIndexMetadata(IndicesService.java:547) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.cluster.metadata.MetaDataUpdateSettingsService$1.execute(MetaDataUpdateSettingsService.java:199) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:639) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:268) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) ~[elasticsearch-6.4.2.jar:6.4.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_181]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_181]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]

Right before the crash, i have a large amount of lines like this repeating:

[2019-03-03T20:01:04,696][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][506867] overhead, spent [19s] collecting in the last [19s]
[2019-03-03T20:01:16,613][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][old][506868][12652] duration [11.9s], collections [1]/[11.9s], total [11.9s]/[12.8h], memory [3.9gb]->[3.9gb]/[3.9gb], all_pools {[young] [133.1mb]->[133.1mb]/[1
33.1mb]}{[survivor] [15.6mb]->[15.7mb]/[16.6mb]}{[old] [3.8gb]->[3.8gb]/[3.8gb]}
[2019-03-03T20:01:16,614][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][506868] overhead, spent [11.9s] collecting in the last [11.9s]
[2019-03-03T20:01:28,536][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][old][506869][12653] duration [11.9s], collections [1]/[11.9s], total [11.9s]/[12.8h], memory [3.9gb]->[3.9gb]/[3.9gb], all_pools {[young] [133.1mb]->[133.1mb]/[1
33.1mb]}{[survivor] [15.7mb]->[15.8mb]/[16.6mb]}{[old] [3.8gb]->[3.8gb]/[3.8gb]}
[2019-03-03T20:01:28,536][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][506869] overhead, spent [11.9s] collecting in the last [11.9s]
[2019-03-03T20:01:59,555][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][old][506870][12655] duration [31s], collections [2]/[31s], total [31s]/[12.8h], memory [3.9gb]->[3.9gb]/[3.9gb], all_pools {[young] [133.1mb]->[133.1mb]/[133.1mb
]}{[survivor] [15.8mb]->[16.4mb]/[16.6mb]}{[old] [3.8gb]->[3.8gb]/[3.8gb]}
[2019-03-03T20:01:59,555][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][506870] overhead, spent [31s] collecting in the last [31s]
[2019-03-03T20:03:14,971][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][old][506871][12657] duration [30.6s], collections [2]/[30.6s], total [30.6s]/[12.8h], memory [3.9gb]->[3.9gb]/[3.9gb], all_pools {[young] [133.1mb]->[133.1mb]/[1
33.1mb]}{[survivor] [16.4mb]->[16.1mb]/[16.6mb]}{[old] [3.8gb]->[3.8gb]/[3.8gb]}

Any idea why things are crashing??? Is it a license thing or something else?

mjunaidmuzammil · March 7, 2019, 5:12am

Clearly you are running out of memory. Can you record your heap usage patterns in normal settings? You can use https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html for obtaining the heap usage statistics for all the nodes. You mentioned 12G of RAM on server nodes, how much heap is assigned to the ElasticSearch process? Also can you share the output of _cluster/health endpoint here (https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html)?

gmfreak · March 7, 2019, 1:42pm

Hi mjunaidmuzammil,

The heap size is set to 4GB: -Xms4g and -Xmx4g
I do not have any other nodes running, just a single node at this time.

Output of cluster health:

{
  "cluster_name" : "sugarcrm",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 440,
  "active_shards" : 440,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 440,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 50.0
}

Memory usage is as follows:

Link to node stats: GIST

DavidTurner · March 7, 2019, 2:02pm

Great idea, thank you for doing so. This looks very much like the problem that #36308 fixes. Can you upgrade to 6.5.3 or later to pick up this fix?

mjunaidmuzammil · March 7, 2019, 4:47pm

It appears that you are doing shard over allocation. A good thumb rule is to allocate around 20-25 shards per GB heap.

https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

Under this you can allocate around 100 shards. You need to increase your heap but dont go beyond 50% of the total RAM. You may have to increase your RAM in order to accommodate all the shards or try decreasing the shard count.

DavidTurner · March 7, 2019, 4:53pm

@mjunaidmuzammil I don't think it's a case of too many shards here - there's no way that could explain 800MB of heap (and growing) being retained by an XPackLicenseState. It's a bug in versions 6.4.1 through 6.5.2.

gmfreak · March 7, 2019, 6:42pm

@DavidTurner Thank you for spotting this. I think I will be downgrading to 6.2.4 which is the officially supported configuration for SugarCRM at this time so that I don't have any issues down the line when opening tickets. It looks like I should be fine since this bug was introduced in 6.4.1

Christian_Dahlqvist · March 7, 2019, 6:45pm

Elasticsearch does not support downgrading, so you would need to reindex all data from the current instance to the new one.

gmfreak · March 7, 2019, 6:53pm

@Christian_Dahlqvist That is fine, my dataset is pretty small and I am in a position where I can delete all the indexes and start from scratch.

system · April 4, 2019, 6:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ElasticSearch crashes in single node cluster- Issue #1 Elasticsearch	20	2949	June 12, 2019
Out Of Memory crash, few documents & load Elasticsearch	10	2369	November 11, 2021
Post-upgrade issues: 6.4.2 ES heap running away (6.5.0 too) Elasticsearch	42	4036	January 10, 2019
ES 7.8.1 crashing, insufficient memory... why?! Elasticsearch painless , runtime-fields	5	626	December 12, 2023
ElasticSearch crashes OS? Elasticsearch	13	469	July 6, 2017

Elasticsearch crashing every 7 days or so

Related topics