Replica generation after Elasticsearch restart


#1

Hi,

I am using Elasticsearch 2.3.4 and I have the following issue with rolling restart.

I need to perform some changes in the configuration or install a new plugin. I stopped my indexing processes and followed the steps described in the reference (https://www.elastic.co/guide/en/elasticsearch/reference/2.3/rolling-upgrades.html): disabled shard allocation, performed the changes, started the node. After doing it on all nodes, I enabled the shard allocation.

Now in Marvel I can see that all the replicas disappeared, and the cluster is working on the replica generation. How can I avoid this step? Can't Elasticsearch detect the existing replicas on the nodes? (I have a few TBs of data in my cluster, and this step takes hours or so, and it also has effect on the performance of the system.)

I checked the data folder on the nodes, and it seems that the (old replica) data exists in the ES data folder, but the ES doesn't use it. Now I have a node, which has only 15% free space and a lot of "orphan" shard, and thanks to this ES doesn't allocate any replica on it. Can ES free up this space or can it somehow detect the existing shards?

Regards,
Peter


(Mark Walkom) #2

https://www.elastic.co/guide/en/elasticsearch/reference/2.3/indices-synced-flush.html should apply here.
You can always make an API call before the node restart to make sure as well.


#3

Thanks Mark!

I checked the synced flush, but I don't think that this is the case here.
I store time based data and I use only a few indices per day. After restart it seems that all my replicas are missing and the cluster starts to generate them. I also checked /my_index/_stats/commit?level=shard of the indices and all of them has sync_id.

(I also found the replicas on the hosts (data file system), but ES couldn't detect them.)

It took a weekend to restore the status to green (after replica generation ES started to balance the shards).

Additional information: I restarted my nodes one after the other, so my cluster was up and running all the time (but my indexing engine was disabled).

What am I missing here? Any other idea?


#4

I could narrow down the problem.

In my cluster I have a node which has a smaller disk. Before restart I had only 12% of free space (and the node was between high and low disk-usage watermark). When I restart the cluster, I can see that all my replicas are "ignored" on that node.

Here is a screenshot of my node in Marvel:

After restart the number of segments is dropped. If I check the shards on that node, they are all primaries, no replica. The sum of "data" and "free disk space" is only 160GB, however there is a 500GB disk for the data (and there is nothing else next to it), the "rest" is the data of "ignored" replicas. (The replica of the .security index is also missing on that node, and because of the watermark, ES can't allocate it.) Somehow ES detects them to free up the space over time, but it takes a while.

Is it possible than ES identifies these replica shard as "new shards", and because of low watermark it stops allocating them (despite the data is already on the disk)?


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.