Several hours for elasticsearch upgrade

I've an ELK server (single node) based on CentOS which I created last year. If I remember correctly I've upgraded this from 2.x -> 5.1 -> 5.3 -> 5.4.1 -> 5.4.2. Today it has about 700GB of data (size of data folder).

Every time I start the elasticsearch service after upgrade, it takes a long time for kibana to start showing data. Recently I figured out that I can check the status with

curl -s -XGET "http://localhost:9200/_cat/indices?v"|grep kibana

and as long as this is red, kibana does not load. I also realized that elasticsearch converts the 500+ indexes on this server in red status to yellow status over a few hours.

Is it normal that even 5.4.1 to 5.4.2 upgrade should take several hours for elasticsearch to be fully available? Is it possible to upgrade with minimal downtime?

1 Like

Hey,

have you checked the rolling upgrade documentation, especially the synced flush is something you might want to check out.

--Alex

Would synced flush help even with a single node cluster?

i.e. stop the logstash service (which is sending data to elasticsearch), issue sync flush (I guess this may take a long time), upgrade elasticsearch, start elasticsearch service - now kibana should be available in few minutes instead of hours.

I'll try this next time.

yes it would. Also it is not an expensive operation like you assumed. Just adding a marker basically.

v5.4.3 was released and I got a chance to check this by upgrading my single node cluster from v5.4.2.

Summary: still a downtime of more than 1 hr.

Details:

  • Stopped logstash service
  • Started a sync flush which took about 12 mins. Kibana was available at this time
  • did a yum update - this stopped kibana but did not stop/restart elasticsearch
  • restarted elasticsearch service which took nearly 5 mins to start responding to http requests
  • started kibana service and kibana loaded with red status
  • periodically checked the number of red indices and if kibana index is red using cat/indices API. The number of red indices dropped continuously but it took about an hour for all to be done.
  • kibana was one of the last or the last one to be indexed
  • kibana loaded fine only after this

Some details about my cluster:

  • 436 indices
  • 2826 shards (the number of shards and replicas is not consistent in all indices)
  • size_in_bytes : 799218702569 (744GB)

That is in my opinion a lot of shards for a single node in, especially since the average shard size is only around 250MB. I would recommend using the shrink index API to reduce the number of primary shards to 1 for indices with less that a few GB of data. It may also be beneficial to try and reduce the number of indices in order to increase the average shard size. Reducing the number of indices and shards the node need to manage should speed things up.

1 Like

Thanks! I'll try the shrink index. Any recommendations for average shard size?

Also, 2826 shards is as reported by the GET /_stats API. Out of these only 1505 are active/pri as per /_cat/health API. I guess the others are replica which will never get assigned in my single-node cluster.

Anyway, I just tried another experiment. I just restarted elasticsearch service and kibana was down for about 27 mins and the number of red indices dropped to 0 in 30 mins. So, even though upgrade takes longer, just a start also takes a long time.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.