Elasticsearch slows down over time

Hi there,

Our Elasticsearch started running very slow. After every restart I am able to use Kibana but it gradually slows down and eventually I get timed out.

I found the following in the logs:

Sep 20 08:58:02 localhost dockerd[3952]: [2019-09-19T22:58:02,333][WARN ][o.e.m.j.JvmGcMonitorService] [TVag-Jr] [gc][old][1637][55] duration [10.9s], collections [1]/[11.6s], total [10.9s]/[18s], memory [3.7gb]->[3.4gb]/[3.9gb], all_pools {[young] [355.8mb]->[54.9mb]/[532.5mb]}{[survivor] [55.3mb]->[0b]/[66.5mb]}{[old] [3.3gb]->[3.3gb]/[3.3gb]}

So I tried to increase heap size but id didn't help.

Here are some information about our environment:

-There are three containers running Elasticsearch 5.5.1, Kibana 5.5.1 and Logstash 5.3.0.
-Rabbitmq 3.6.0 is configured on VM itself.
-VM has 8CPUs and 20GB of dedicated memory.
-Heap size is 10GB.

Cluster status:

[root@eap-elk01 dragan]# curl -XGET 'localhost:8010/_cluster/health?pretty'
{
  "cluster_name" : "docker-cluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 7233,
  "active_shards" : 7233,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 7233,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 50.0
}

As you can see there are plenty of shards and I assume that that is the main cause of the issue.

Any help would be much appreciated!

Cheers

As you can see from your error message you are indeed running out of heap, which is crippling your cluster. Increasing the amount of heap is the right thing to do but you do not have much space on that VM to do so given everything that is running on it. Elasticsearch also need off-heap memory in addition to the heap in order to work well and it is recommended to not assign more than 50% of available RAM to the Elasticsearch heap. Since you are running other services on your VM the total amount of RAM available to Elasticsearch is not 20GB, which means that 10GB heap is too large.

You are correct in that the very large number of shards is likely causing this and you need to reduce this dramatically. Please read this blog post for some practical guidance.

Thanks for your reply @Christian_Dahlqvist

There is plenty of available memory on my VM:

[root@eap-elk01 dragan]# free -mh
              total        used        free      shared  buff/cache   available
Mem:            17G        8.1G        1.3G        1.7M        8.0G        8.9G
Swap:          2.0G         20M        2.0G

Do you think that allocating more memory to heap would help? I could allocate more RAM to my VM...

Thanks

You do seem to need more heap than you had when you received the error message you posted above. Increasing from 4GB to 6GB might be enough to give you room to work with the node. A 6GB heap means you should consider 12GB occupied by Elasticsearch due to the off-heap and page cache requirements. You should also avoid having swap enabled on Elasticsearch nodes.

It looks like the heap size is not a problem. I added some additional memory to VM 24 GB in total.

              total        used        free      shared  buff/cache   available
Mem:            23G         11G        2.3G        2.4M        9.2G         11G
Swap:          2.0G         81M        1.9G

I also changed heap size:

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms6g
-Xmx6g

But still no good.

Kibana output changed to {"code":"ECONNRESET"}:

Any Idea where to go from here?

Thanks

As Christian_Dahlqvist pointed out in a previous reply, you have far too many shards in your 1-node cluster.

The official recommendation is to have fewer than 20 shards per GB of Java Heap Space and since your data node has 6 GB you should strive for less than 6 * 20 = 120 shards. Your cluster currently has 7233 primary shards (and the same number of useless replica shards), so about 7100 shards too many. Because the cluster state must keep control of every shard and each query may open a search thread against each shard you'll end up with a very slow cluster which will take even longer to recover after a blackout. You really only have two options to resolve this situation:

  1. Add a lot more data nodes to the cluster, at least 7100/120 = 59 data nodes.
  2. Delete most of the shards, ideally 7000+ to get below 120 on your single data node.

I suspect #2 is the way to go.

You probably have over-sharded your indices, i.e. given each index too many shards. ideally a shard should store 20-40 GB of data, so if your index is of that size you only need one primary shard. If you have even smaller indices you should try to merge them by reindexing several smaller into one large index. But to get started, since you have so many shards slowing down the cluster operations, you should start by deleting indices you don't need or can later re-create. You need to lower that shard count dramatically.

And as long as you only have one data node you might as well remove all the unassigned replica shards, because they can't be assigned to the same node as the primary shards:

curl -X PUT "http://localhost:9200/_all/_settings" -H "Content-Type: application/json" -d'
{
    "index" : {
        "number_of_replicas": 0
    }
}'

Hope this helps you on the way.

1 Like

I added more memory and elasticsearch is stable now but I will definitely start working on optimisation.

Thanks @Bernt_Rostad and @Christian_Dahlqvist . Your help is much appreciated.

Cheers,
Dragan

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.