Elasticsearch stops ingesting data when node is down

Hi there,

I have an ES cluster which stops ingesting data every time a node is down.
As it is a cluster used for debugging purposes only, it retains logs has no replication for most of its shards. There are 13 data nodes (i3.3xlarge) and 3 dedicated masters (m4.large).

Also, it seems that ingestion rate decreases everytime shards get reallocated, as well as when a certain node faces high cpu usage.

Any idea what may be happening?

Thanks!

Summary
[root@ip-10-0-0-212 elasticsearch]# curl localhost:9200
{
"name" : "tfg-es-logs-cluster-node-x",
"cluster_name" : "my-cluster",
"cluster_uuid" : my-uid",
"version" : {
"number" : "5.6.5",
"build_hash" : "6a37571",
"build_date" : "2017-12-04T07:50:10.466Z",
"build_snapshot" : false,
"lucene_version" : "6.6.1"
},
"tagline" : "You Know, for Search"
}

Cluster health:
[ec2-user@ip-10-x-x-x ~]$ curl localhost:9200/_cluster/health?pretty
{
"cluster_name" : "my-cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 16,
"number_of_data_nodes" : 13,
"active_primary_shards" : 8135,
"active_shards" : 8358,
"relocating_shards" : 1,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

SO:
Amazon Linux
[ec2-user@ip-10-0-0-212 ~]$ uname -a
Linux ip-10-0-0-212 4.9.77-31.58.amzn1.x86_64 #1 SMP Thu Jan 18 22:15:23 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Elasticsearch.yml
cluster.name: my-clyster
node.name: my-cluster-my-node
path.conf: "/etc/elasticsearch"
path.data: "/mnt/es_disk"
path.logs: "/var/log/elasticsearch"
network.host: 0.0.0.0
discovery.type: ec2
discovery.ec2.groups: sg-my-sg
discovery.ec2.any_group: false
cloud.node.auto_attributes: true
bootstrap.memory_lock: true
script.inline: true
script.stored: true
xpack.security.enabled: false
xpack.monitoring.enabled: false
node.data: true
node.master: false
node.ingest: true
node.attr.index_type: log

If you do not have a replica configured for an index, the index will go red if you lose a shard, which means you can not index into it. If you did have a replica, Elasticsearch would promote the replica to primary if you lose one, and you can continue indexing into it as there is at least one copy of each shard.

Thanks @Christian_Dahlqvist .

Let's imagine node X is down for a few minutes. Node X contains primary shards px1, px2 and px3. We also have node Y with primary shards py1, py2 and py3. If Node X is down, should this affect ingestion of py1, py2 and py3? (considering both scenarios where px_i and py_i belong to the same and different indices)

If they belong to the same index then indexing should stop - as there will be documents being indexed that need to be allocated to primary shards that are not available.

If they belong to different indexes then happy days.

I recommend turning on compression on the indexes and replicating the data. Data lost is not fun even if the data is just logs.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.