Elastic 2.3.4. Node Startup Quiet Failure

deppi · January 25, 2018, 5:30pm

We are using a 5 node cluster hosted in Google Cloud (Ubuntu 16.04 LTS) and we noticed that one of the node's disk space was at 90%+ so we shut down the node with:
sudo service elasticsearch stop
Then stopping the instance in the GCP console.

After upgrading the node's disk space, we tried starting elastic again using:
sudo service elasticsearch start
This command seems to fail silently, and the SSH session terminates after freezing momentarily. Nothing shows in the node's elasticsearch logs, and nothing shows up in the current cluster's master elasticsearch logs either. The only hint we can find of something going wrong is in the node's syslog:

    Jan 25 15:48:29 elasticsearch-1-vm systemd[1]: Started Cleanup of Temporary Directories.
    Jan 25 15:48:29 elasticsearch-1-vm systemd[1]: Starting Elasticsearch...
    Jan 25 15:48:29 elasticsearch-1-vm systemd[1]: Started Elasticsearch.
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.597729] kernel tried to execute NX-protected page - exploit attempt? (uid: 113)
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.605545] BUG: unable to handle kernel paging request at 00007f896d5467c0
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.612621] IP: 0x7f896d5467c0
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.615779] PGD 80000003050ee067
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.615780] P4D 80000003050ee067
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.619199] PUD 30508d067
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.622626] PMD 305162067
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.625438] PTE 80000003df15b867
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.628245]
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.633174] Oops: 0011 [#1] SMP PTI

The cluster health with 4 nodes is green, and we can't seem to figure out why this may be happening.

Any ideas on why this may be happening would be very helpful.

Here is our config located in /etc/default/elasticsearch:

gist.github.com

https://gist.github.com/deppi/58826c38ea8414d301eb034e9a29cd54

elasticsearch

################################
# Elasticsearch
################################

# Elasticsearch home directory
#ES_HOME=/usr/share/elasticsearch

# Elasticsearch configuration directory
#CONF_DIR=/etc/elasticsearch

This file has been truncated. show original

Also here is our /etc/elasticsearch/elasticsearch.yml

gist.github.com

https://gist.github.com/deppi/17b1f28e649ee528b0fe2ca93a2ff19c

elasticsearch.yml

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please see the documentation for further information on configuration options:

This file has been truncated. show original

The only thing I can think that might be causing this issue is discovery.zen.minimum_master_nodes: 2
When maybe it should be configured as
discovery.zen.minimum_master_nodes: 3
But we are uncertain this is the issue and don't want to risk further breaking out elasticsearch cluster.

dadoonet · January 25, 2018, 5:42pm

Did you upgrade the Kernel?

See

deppi · January 25, 2018, 6:02pm

Thank you for the reply!

/usr/share/elasticsearch/bin/elasticsearch --version gives us:
Version: 2.3.4, Build: e455fd0/2016-06-30T11:24:31Z, JVM: 1.8.0_151

java -version gives us:
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

We haven't done any OS level upgrades, I double checked out /var/log/apt/history.log file.

But your question made me check the OS we are using. the node that cannot join is using Ubuntu 16.04 LTS. Of the 4 operational nodes we have in the cluster, 2 are using Ubuntu 16.04 and 2 are using Debian Jessie. Could this be part of the issue?

deppi · January 25, 2018, 6:52pm

Update:
Comparing the kernels of two nodes that are successfully running using Ubuntu 16.04 versus the kernel of the node that is not successfully running:
Successful:
4.13.0-1006-gcp #9-Ubuntu SMP Mon Jan 8 21:13:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Unsuccessful:
4.13.0-1007-gcp #10-Ubuntu SMP Fri Jan 12 13:56:47 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Looking into reverting the kernel upgrade. It must have been done automatically by Google.

deppi · January 25, 2018, 7:18pm

If anyone is working with GCP instances, using Ubuntu 16.04 LTS, I downgraded the kernel from:
uname -a
Linux elasticsearch-1-vm 4.13.0-1007-gcp #10-Ubuntu SMP Fri Jan 12 13:56:47 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
To:
uname -a
Linux elasticsearch-1-vm 4.13.0-1006-gcp #9-Ubuntu SMP Mon Jan 8 21:13:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

To fix the issue with the GCP instances, I ran:

sudo apt remove 4.13.0-1007-gcp
sudo apt install 4.13.0-1006-gcp
exit

Then in google cloud console, restart the instance, then SSH back in then:
sudo service elasticsearch start

system · February 22, 2018, 7:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic unable to start after turning it off Elasticsearch docker	6	607	October 12, 2022
Cannot start services after restart Elasticsearch	1	316	August 7, 2018
Elasticsearch node does not starts Elasticsearch	6	2023	March 18, 2019
Cannot start node in cluster Elasticsearch	2	398	June 2, 2018
Elasticsearch fails to start on Ubuntu nodes, other issues Elasticsearch	4	299	August 13, 2021

Elastic 2.3.4. Node Startup Quiet Failure

Related topics