Elastic 2.3.4. Node Startup Quiet Failure


(Geoff) #1

We are using a 5 node cluster hosted in Google Cloud (Ubuntu 16.04 LTS) and we noticed that one of the node's disk space was at 90%+ so we shut down the node with:
sudo service elasticsearch stop
Then stopping the instance in the GCP console.

After upgrading the node's disk space, we tried starting elastic again using:
sudo service elasticsearch start
This command seems to fail silently, and the SSH session terminates after freezing momentarily. Nothing shows in the node's elasticsearch logs, and nothing shows up in the current cluster's master elasticsearch logs either. The only hint we can find of something going wrong is in the node's syslog:

    Jan 25 15:48:29 elasticsearch-1-vm systemd[1]: Started Cleanup of Temporary Directories.
    Jan 25 15:48:29 elasticsearch-1-vm systemd[1]: Starting Elasticsearch...
    Jan 25 15:48:29 elasticsearch-1-vm systemd[1]: Started Elasticsearch.
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.597729] kernel tried to execute NX-protected page - exploit attempt? (uid: 113)
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.605545] BUG: unable to handle kernel paging request at 00007f896d5467c0
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.612621] IP: 0x7f896d5467c0
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.615779] PGD 80000003050ee067
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.615780] P4D 80000003050ee067
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.619199] PUD 30508d067
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.622626] PMD 305162067
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.625438] PTE 80000003df15b867
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.628245]
    Jan 25 15:48:30 elasticsearch-1-vm kernel: [  919.633174] Oops: 0011 [#1] SMP PTI

The cluster health with 4 nodes is green, and we can't seem to figure out why this may be happening.

Any ideas on why this may be happening would be very helpful.


Here is our config located in /etc/default/elasticsearch:

Also here is our /etc/elasticsearch/elasticsearch.yml

The only thing I can think that might be causing this issue is discovery.zen.minimum_master_nodes: 2
When maybe it should be configured as
discovery.zen.minimum_master_nodes: 3
But we are uncertain this is the issue and don't want to risk further breaking out elasticsearch cluster.


(David Pilato) #2

Did you upgrade the Kernel?

See


(Geoff) #3

Thank you for the reply!

/usr/share/elasticsearch/bin/elasticsearch --version gives us:
Version: 2.3.4, Build: e455fd0/2016-06-30T11:24:31Z, JVM: 1.8.0_151

java -version gives us:
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

We haven't done any OS level upgrades, I double checked out /var/log/apt/history.log file.

But your question made me check the OS we are using. the node that cannot join is using Ubuntu 16.04 LTS. Of the 4 operational nodes we have in the cluster, 2 are using Ubuntu 16.04 and 2 are using Debian Jessie. Could this be part of the issue?


(Geoff) #4

Update:
Comparing the kernels of two nodes that are successfully running using Ubuntu 16.04 versus the kernel of the node that is not successfully running:
Successful:
4.13.0-1006-gcp #9-Ubuntu SMP Mon Jan 8 21:13:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Unsuccessful:
4.13.0-1007-gcp #10-Ubuntu SMP Fri Jan 12 13:56:47 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Looking into reverting the kernel upgrade. It must have been done automatically by Google.


(Geoff) #5

If anyone is working with GCP instances, using Ubuntu 16.04 LTS, I downgraded the kernel from:
uname -a
Linux elasticsearch-1-vm 4.13.0-1007-gcp #10-Ubuntu SMP Fri Jan 12 13:56:47 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
To:
uname -a
Linux elasticsearch-1-vm 4.13.0-1006-gcp #9-Ubuntu SMP Mon Jan 8 21:13:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

To fix the issue with the GCP instances, I ran:

sudo apt remove 4.13.0-1007-gcp
sudo apt install 4.13.0-1006-gcp
exit

Then in google cloud console, restart the instance, then SSH back in then:
sudo service elasticsearch start


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.