ElasticSearch stops working after sometime

My ElasticSearch server works fine for a few hours or a day and suddenly stops working. It has 1 shard and 1 replica on a single node installed on VPS along with the application server and has only 1 index with 30,000 documents.

My Configuration:

  • VPS: 1 CPU Core, 2GB RAM
  • Ubuntu 20.10
  • ElasticSearch Version: 7.8.0
  • Heap Size: -Xms1g -Xmx1g

When I checked the logs it seems to me that the ElasticSearch server stops after the health check.

    [2021-03-29T01:30:00,007][INFO ][o.e.x.m.MlDailyMaintenanceService] [node-1] triggering scheduled [ML] maintenance tasks
    [2021-03-29T01:30:00,032][INFO ][o.e.x.s.SnapshotRetentionTask] [node-1] starting SLM retention snapshot cleanup task
    [2021-03-29T01:30:00,084][INFO ][o.e.x.s.SnapshotRetentionTask] [node-1] there are no repositories to fetch, SLM retention snapshot cleanup task complete
    [2021-03-29T01:30:00,232][INFO ][o.e.x.m.a.TransportDeleteExpiredDataAction] [node-1] Deleting expired data
    [2021-03-29T01:30:00,611][INFO ][o.e.x.m.j.r.UnusedStatsRemover] [node-1] Successfully deleted [0] unused stats documents
    [2021-03-29T01:30:00,621][INFO ][o.e.x.m.a.TransportDeleteExpiredDataAction] [node-1] Completed deletion of expired ML data
    [2021-03-29T01:30:00,622][INFO ][o.e.x.m.MlDailyMaintenanceService] [node-1] Successfully completed [ML] maintenance task: triggerDeleteExpiredDataTask
    [2021-03-29T02:38:45,814][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][60425] overhead, spent [423ms] collecting in the last [1s]
    [2021-03-29T14:02:17,728][WARN ][o.e.m.f.FsHealthService  ] [node-1] health check of [/var/lib/elasticsearch/nodes/0] took [12258ms] which is above the warn threshold of [5s]
    [2021-03-29T14:07:46,549][WARN ][o.e.m.f.FsHealthService  ] [node-1] health check of [/var/lib/elasticsearch/nodes/0] took [5140ms] which is above the warn threshold of [5s]
    [2021-03-29T14:09:17,396][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][101248] overhead, spent [553ms] collecting in the last [1.7s]   

Another log

    [2021-03-31T02:01:56,154][INFO ][o.e.c.r.a.AllocationService] [node-1] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[reports][0]]]).
    [2021-03-31T06:08:19,126][WARN ][o.e.m.f.FsHealthService  ] [node-1] health check of [/var/lib/elasticsearch/nodes/0] took [6940ms] which is above the warn threshold of [5s]
    [2021-03-31T06:54:45,818][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][young][16670][19] duration [2.7s], collections [1]/[26.8s], total [2.7s]/[4.4s], memory [694.9mb]->[90.9mb]/[1gb], all_pools {[young] [604mb]->[0b]/[0b]}{[old] [89.9mb]->[89.>
    [2021-03-31T07:03:44,953][WARN ][o.e.m.f.FsHealthService  ] [node-1] health check of [/var/lib/elasticsearch/nodes/0] took [7148ms] which is above the warn threshold of [5s]
    [2021-03-31T07:11:03,918][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][young][16975][20] duration [11.3s], collections [1]/[12s], total [11.3s]/[15.7s], memory [130.9mb]->[90.8mb]/[1gb], all_pools {[young] [40mb]->[0b]/[0b]}{[old] [89.9mb]->[89.>
    [2021-03-31T07:11:06,610][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][16975] overhead, spent [11.3s] collecting in the last [12s]
    [2021-03-31T07:28:04,708][WARN ][o.e.m.f.FsHealthService  ] [node-1] health check of [/var/lib/elasticsearch/nodes/0] took [5557ms] which is above the warn threshold of [5s]
    [2021-03-31T07:30:30,545][WARN ][o.e.m.f.FsHealthService  ] [node-1] health check of [/var/lib/elasticsearch/nodes/0] took [5035ms] which is above the warn threshold of [5s]
    [2021-03-31T07:35:07,502][WARN ][o.e.m.f.FsHealthService  ] [node-1] health check of [/var/lib/elasticsearch/nodes/0] took [5732ms] which is above the warn threshold of [5s]
    [2021-03-31T07:35:12,985][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][young][17163][21] duration [4.9s], collections [1]/[3.4s], total [4.9s]/[20.6s], memory [126.8mb]->[130.8mb]/[1gb], all_pools {[young] [36mb]->[0b]/[0b]}{[old] [89.9mb]->[89.>
    [2021-03-31T07:35:16,582][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][17163] overhead, spent [4.9s] collecting in the last [3.4s]
    [2021-03-31T07:44:37,323][WARN ][o.e.h.AbstractHttpServerTransport] [node-1] handling request [null][POST][/reports/_count][Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:37814}] took [9836ms] which is above the warn thresho>
    [2021-03-31T07:51:15,633][WARN ][o.e.m.f.FsHealthService  ] [node-1] health check of [/var/lib/elasticsearch/nodes/0] took [7832ms] which is above the warn threshold of [5s]
    [2021-03-31T08:00:57,701][WARN ][o.e.m.f.FsHealthService  ] [node-1] health check of [/var/lib/elasticsearch/nodes/0] took [7313ms] which is above the warn threshold of [5s]
    [2021-03-31T08:05:13,225][WARN ][o.e.m.f.FsHealthService  ] [node-1] health check of [/var/lib/elasticsearch/nodes/0] took [5179ms] which is above the warn threshold of [5s]
    [2021-03-31T08:07:50,096][WARN ][o.e.m.f.FsHealthService  ] [node-1] health check of [/var/lib/elasticsearch/nodes/0] took [6490ms] which is above the warn threshold of [5s]
    [2021-03-31T08:19:56,215][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][young][17648][23] duration [1.1s], collections [1]/[1.4s], total [1.1s]/[21.9s], memory [131mb]->[91mb]/[1gb], all_pools {[young] [40mb]->[0b]/[0b]}{[old] [89.9mb]->[89.9mb]/>
    [2021-03-31T08:19:56,957][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][17648] overhead, spent [1.1s] collecting in the last [1.4s]

I am not sure what could be the reason. Please suggest to me how do I solve this issue?

There's nothing in those logs that shows Elasticsearch shutting down. If there is part of the log that shows that, please post it.

Maybe this might help.

    elasticsearch.service - Elasticsearch
     Loaded: loaded (/lib/systemd/system/elasticsearch.service; enabled; vendor preset: enabled)
     Active: failed (Result: signal) since Tue 2021-04-06 19:15:19 UTC; 1 day 10h ago
       Docs: https://www.elastic.co
       Main PID: 727 (code=killed, signal=KILL)
    Apr 06 19:00:57 products systemd[1]: Starting Elasticsearch...
    Apr 06 19:02:56 products systemd[1]: Started Elasticsearch.
    Apr 06 19:15:19 products systemd[1]: elasticsearch.service: Main process exited, code=killed, status=9/KILL
    Apr 06 19:15:19 products systemd[1]: elasticsearch.service: Failed with result 'signal'.
    Apr 06 19:15:19 products systemd[1]: elasticsearch.service: Unit process 1519 (controller) remains running after unit stopped.

Looks to me like something / someone some process / some scanner is sending a

kill -9

i.e sending a kill command to the elasticsearch process.

At another user a security scanner was killing unrecognized processes...could it be something like that?

Is it may be because of insufficient memory? Maybe Linux or JVM itself killing the ElasticSearch process.

My Configuration are:

  • VPS: 1 CPU Core, 2GB RAM
  • Ubuntu 20.10
  • ElasticSearch Version: 7.8.0
  • Heap Size: -Xms1g -Xmx1g

Everything is running on the same VPS machine, web server, MySQL, elastic, etc.

Elasticsearch worked fine on my development server.

Yes that could be it that is a very small server to be running all of that on..... Less than my phone :slight_smile:

The reason I asked is the reason in the log says kill/9 which is different than a process just dying on its own.

So not sure.... There could some other process taking more resources so elasticsearch dies, I am not sure of the error code when elasticsearch runs out of memory.

It could be an OOM killer I guess?

1 Like

Finally, I got this issue resolved by increasing server memory from 2GB to 4GB. Actually, due to insufficient memory on the VPS server, the kernel itself was killing JVM process, ultimately, closing the ElasticSearch server.

I also decreased heap size from 1GB to 750MB.

Apr 10 10:55:43 products kernel: Out of memory: Killed process 728 (java)

Thank you Stephen and David for the help. I really appreciate your help.

1 Like

You should definitely avoid that if this is meant for production. For tests purposes, I guess it's fine.

But glad you solved the problem.

1 Like

To echo @dadoonet in production only elasticsearch should run on a single vm... No other applications...