Please help - ES 2.1.1 cluster randomly crashing

Hi,

I have an Elasticsearch cluster of 2 nodes, both running ES2.1.1, but I have had random crashes where the master node will fail, then sometime after the second node will also fail.

All nodes are clean ES2.1.1 installations, on Amazon Linux, using the RPM installer, with plugins cloud-aws, kopf and head installed. There is minimum configuration of the elasticsearch.yml file to enable cluster replication within AWS.

If I start the elasticsearch process back, the system runs fine for 4+ days fine then crashes again.

Honestly, I've looked through the various ES logs and I cannot see any obvious exceptions.

I'm hoping someone here can help me determine why this is happening?

Hi,

How much memory have you given Elasticsearch in /etc/sysconfig/elasticsearch in ES_HEAP_SIZE?

@msimos I have not configured that setting at all, so it is the default. I
have not checked yet, but I will soon and let you know.

If you know what the unconfigured default is, I think it would be that
though.

Thanks.

@msimos both the nodes crashed again.
I set the ES_HEAP_SIZE to 1g, as the nodes have 2GB RAM, and restarted both of them.

I would be thankful for any advice on how to determine why these nodes are crashing so frequently.

Please anyone who might be able to help, this keeps happening.

Impossible to know. We have absolutely no information about what you are doing.

  • number of indices
  • number of shards
  • volume
  • queries (using aggs, sorting?)

Is it a demo? Or a real production server?

1g seems pretty low to me.

@dadoonet I'm aware that there is little information here, but rather than dump every log and config into here, I thought I would ask if there is a good place to start.

I'm asking for some assistance is all, on why a simple setup should fail so quietly.

What I am doing is running a fresh-installation cluster with 3 plugins installed to enable replication within the AWS environment. Replication was working, but it randomly crashes.

To answer your questions:
-131 indicies
-1302 shards
-volume? its about 1,201,000 documents taking up around 1GB space on disk
-queries? I have localised kibana 4.3.1 installations on the ES2.1.1 nodes, which are the only entities using the cluster.

I'm putting this together as pre-production at the moment, but its been extremely unstable since I started using 2.x some months ago.

This is all using fresh installations, no upgrades.

1g is specified per the recommendation in documentation that states ES_HEAP_SIZE should be 50% of available RAM, as I mentioned above.

Too many shards for sure.

It's like running 1000 databases on a single machine with 1gb RAM.

Ok, I've been judging by memory and CPU usage, which has been pretty
nominal (CPU ~20%, memory around 60-70%).

Still, I've decided to once again rebuild the cluster from scratch. I
understand that there may have been load on the systems, but I don't see
why they should just crash like that, I can understand them running slow or
throwing up errors.

Would you have a recommendation for node sizing within AWS? What do you
use? An instance type would be useful.

I don't know why it's crashing. I assume you had some errors in logs or GC warnings...

Reduce the number of shards (1 shard is often enough) and may be reduce the number of indices. But I don't know about your use case so I don't know if it's doable of not.

If you don't want to suffer from noisy neighbors x-large instances are interesting.

Also, if you don't want to manage that by yourself, I'd recommend looking at found (elasticsearch as a service by elastic).

Given the amount of data you have the instance type you are using may very well be sufficient, as long as you drastically reduce the number of shards. Each shard is as David pointed out a separate Lucene index and carries with it some overhead in terms of file descriptors and memory.

I would recommend reducing both the number of indices as well as the number of shards per index. Given your heap size and data volume, aiming to have tens rather than thousands of shards in the cluster might be a good target.

@dadoonet @Christian_Dahlqvist
This is a fairly out-of-box installation, so I have not been manually administering the shards, they may have simply grown out of control for some reason.

I'm going to investigate how to limit shards, but I would appreciate if you had any advice on config settings related to this.

Thanks.

EDIT: I found the index.number_of_shards setting, I will be rebuilding with this set to 1 in the elasticsearch.yml file. Found the details here.

(For anyone else wondering about this, the default is set to 5, so if there are 131 indicies, replicated to a single mirror node, that would explain why there are 1302 shards. The last 2 are .kibana, I think.)
==> 131x5x2=1310

EDIT2: I'm going to reduce the shard count primarily by setting the index.number_of_shards setting to 1, but in addition, I'm going to investigate using less indices, as recommended by @dadoonet and @Christian_Dahlqvist.
This second part I think I need to do on the Logstash side, by setting the index setting in my output config, as according to this docs page , it defaults to "logstash-%{+YYYY.MM.dd}", creating an index for each day as standard.
I think setting this to a definite single index such as "logstash-someindexname", would allow me to leave the index.number_of_shards at the unconfigured default of 5.

It is quite easy to change from daily to e.g. monthly indices in Logstash just by removing the date part of the date pattern (to "logstash-%{+YYYY.MM}"). You can also modify the template it has uploaded to Elasticsearch and specify that each new index should use 1 shard and 1 replica there or change the default values in elasticsearch.yml.

@dadoonet and @Christian_Dahlqvist you've helped me a lot. Thanks!

I've rebuilt my cluster, and it's now running much faster (Although I cant speak for stability yet).
I did this by setting my elasticsearch output index => "logstash-%{+YYYY.MM}" to a single index, rather than let it build a new index for each day.

My updated status:
-3 indices
-22 shards
-490,000 documents (currently)
-Heap usage is between 20% and 50%

So far it looks good, but I've discovered a strange issue of duplicate events.
I don't want to hijack this thread, so I created another here to explain the issue.

Maybe you might be able to have a look in there if you're interested.

Thanks.

22 shards is still a lot on a single machine with few memory IMO.

I can inject for example 1m docs in a single index with 5 shards.
May be you could imagine having one index (1 shard - 0 replica as you have a single node) per year instead of per month?

Well, I think its more like 11 shards per node at this point, split across 3 indices (One of those indices is the .kibana index, so its more like 10 shards per node across 2 indices.).

So its like 10 shards per machine, if I'm thinking about this correct.

Its running much faster at this point, and I think i'll want to experiment with reducing the shard count also, but I'm going with 1 change at a time to test.

Two threads with useful information about number of shards in an ES cluster


Thanks @anhlqn, this is helpful reading.

I've taken note on some of this info.