"no known master node, scheduling a retry"

infinitetutts · July 22, 2016, 2:10pm

I Suspect my Elasticsearch is not setup correctly to handle the massive amounts of logs I have.

Logs of this morning:

[2016-07-22 04:01:50,401][WARN ][cluster.action.shard ] [Margali Szardos] [filebeat-2016.07.22][1] received shard failed for target shard [[filebeat-2016.07.22][1], node[8Axw9bfLQ1ejwYGk67tnMg], [P], v[4], s[INITIALIZING], a[id=M8E1OD30SFOCSflBjG4wgw], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-07-22T02:54:28.460Z], details[failed recovery, failure IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: OutOfMemoryError[Java heap space]; ]]], indexUUID [44N8S3trRMiecFHILJYO_w], message [failed recovery], failure [IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: OutOfMemoryError[Java heap space]; ]
[filebeat-2016.07.22][[filebeat-2016.07.22][1]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: OutOfMemoryError[Java heap space];
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:250)
at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: [filebeat-2016.07.22][[filebeat-2016.07.22][1]] EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: OutOfMemoryError[Java heap space];
Caused by: java.lang.OutOfMemoryError: Java heap space

When I try start Elasticsearch:

[2016-07-22 14:36:58,579][INFO ][node ] [Aldebron] version[2.3.4], pid[40853], build[e455fd0/2016-06-30T11:24:31Z]
[2016-07-22 14:36:58,579][INFO ][node ] [Aldebron] initializing ...
[2016-07-22 14:36:59,138][INFO ][plugins ] [Aldebron] modules [reindex, lang-expression, lang-groovy], plugins , sites
[2016-07-22 14:36:59,164][INFO ][env ] [Aldebron] using [1] data paths, mounts [[/data (/dev/sdb1)]], net usable_space [1.2tb], net total_space [1.6tb], spins? [possibly], types [xfs]
[2016-07-22 14:36:59,164][INFO ][env ] [Aldebron] heap size [30.8gb], compressed ordinary object pointers [true]
[2016-07-22 14:36:59,164][WARN ][env ] [Aldebron] max file descriptors [65535] for elasticsearch process likely too low, consider increasing to at least [65536]
[2016-07-22 14:37:01,826][INFO ][node ] [Aldebron] initialized
[2016-07-22 14:37:01,826][INFO ][node ] [Aldebron] starting ...
[2016-07-22 14:37:02,613][INFO ][transport ] [Aldebron] publish_address {127.0.0.1:9301}, bound_addresses {127.0.0.1:9301}
[2016-07-22 14:37:02,629][INFO ][discovery ] [Aldebron] elasticsearch/akJzSybBRlObSKslNr3dVQ
[2016-07-22 14:37:32,632][WARN ][discovery ] [Aldebron] waited for 30s and no initial state was set by the discovery
[2016-07-22 14:37:32,679][INFO ][http ] [Aldebron] publish_address {127.0.0.1:9200}, bound_addresses {127.0.0.1:9200}
[2016-07-22 14:37:32,679][INFO ][node ] [Aldebron] started
[2016-07-22 14:37:34,473][DEBUG][action.admin.indices.create] [Aldebron] no known master node, scheduling a retry
[2016-07-22 14:37:50,729][INFO ][discovery.zen ] [Aldebron] failed to send join request to master [{Margali Szardos}{8Axw9bfLQ1ejwYGk67tnMg}{127.0.0.1}{127.0.0.1:9300}], reason [RemoteTransportException[[Margali Szardos][127.0.0.1:9300][internal:discovery/zen/join]]; nested: ConnectTransportException[[Aldebron][127.0.0.1:9301] connect_timeout[30s]]; ]
[2016-07-22 14:38:34,476][DEBUG][action.admin.indices.create] [Aldebron] timed out while retrying [indices:admin/create] after failure (timeout [1m])
[2016-07-22 14:38:34,483][WARN ][rest.suppressed ] path: /_bulk, params: {}
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];]

htop:

25629 elasticse 20 0 69.9G 2694M 1139M S 0.0 2.1 0:02.99 /usr/bin/java -Xms256m -Xmx1g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFra

How do I fix this ?

jpcarey · July 23, 2016, 4:33am

What do you have ES_HEAP_SIZE set to?

infinitetutts · July 25, 2016, 8:26am

My Elasticsearch sorted itself out last night. Its back up and running but I lost all my data.

echo $ES_HEAP_SIZE
gives me no output.

I have set this:
In /etc/default/elasticsearch
ES_HEAP_SIZE=31g

htop contains dozens of these lines:
25629 elasticse 20 0 69.9G 2686M 1131M S 0.0 2.1 0:02.99 /usr/bin/java -Xms256m -Xmx1g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyF

My elasticsearch folder is quiet big:
du -hd1 . 252G ./elasticsearch 170G ./nodes 422G .

jpcarey · July 25, 2016, 4:17pm

How are you starting elasticsearch? The /etc/default/elasticsearch setting looks ok, but it does not appear to be used - since your htop is showing -Xms256m -Xmx1g.

My Elasticsearch sorted itself out last night. Its back up and running but I lost all my data.

elasticsearch does not work like that, it is either configured correctly or not. I suspect that you are starting elasticsearch manually using bin/elasticsearch ..., rather than as the service. Starting as the service will include / use the /etc/default/elasticsearch and /etc/elasticsearch/elasticsearch.yml. These will use default locations (unless otherwise configured): Directory layout | Elasticsearch Guide [8.11] | Elastic

If you started manually, these default locations would be used. You probably did not loose all your data, rather it is at a different location (and possibly under a different cluster name, ex. default = elasticsearch).

infinitetutts · July 26, 2016, 10:07am

My elasticsearch was working fine for a week before this happened. Then I lost my data, I tried restarting it and a few things nothing worked. And the next day it was working but all my data was lost.

I am running elasticsearch as a service and only use the service command:
`service elasticsearch status

elasticsearch is running`

I can still see that my data is in the location I set it to be on. But I cant access it within elasticsearch.

Do you know how to prevent elasticsearch from doing this again ? What do I need to setup to handle masses of logs ?

anhlqn · July 27, 2016, 5:42am

Does you current cluster name and node name match those before you restart ES?
How many ES nodes are you running?

Your logs are not really big, so they should not be the cause of the issue. I think it should be something with your ES config. Two ES nodes can easily handle TB of logs.

infinitetutts · July 27, 2016, 11:40am

I have 1 node:

{ "cluster_name" : "elasticsearch", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 31, "active_shards" : 31, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 31, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 50.0 }

I just restarted elasticsearch took I think 20 minutes before I could access it.

Before restart:
curl -XGET 'http://localhost:9200/_nodes' {"cluster_name":"elasticsearch","nodes":{"_g90gd-JRrCL88gJLWxJfg":{"name":"Zodiak","transport_address":"127.0.0.1:9301

After restart:
curl -XGET 'http://localhost:9200/_nodes' {"cluster_name":"elasticsearch","nodes":{"akJzSybBRlObSKslNr3dVQ":{"name":"Aldebron","transport_address":"127.0.0.1:9301"

anhlqn · July 27, 2016, 5:54pm

Why would you have different node name after restart? Did you change the node.name: from Zodiak to Aldebron in the config file before restart?

Have you renamed the data folder to match the new node name and updated the path.data? An easier way is to update your node.name: to Zodiak and restart ES.

infinitetutts · July 28, 2016, 9:51am

I have not set any settings in my elasticsearch.yml file expect for this:

network.host: 127.0.0.1 path.data: /data/var/lib/elasticsearch

How may shares and nodes do I need for my setup ? And are there any other settings I need to set in the config file ?

$ curl -XGET "http://localhost:9200/_cat/shards?v" index shard prirep state docs store ip node filebeat-2016.07.24 3 p STARTED 84504 13.3mb 127.0.0.1 Zodiak filebeat-2016.07.24 3 r UNASSIGNED filebeat-2016.07.24 4 p STARTED 83849 13.1mb 127.0.0.1 Zodiak filebeat-2016.07.24 4 r UNASSIGNED filebeat-2016.07.24 2 p STARTED 84337 13.2mb 127.0.0.1 Zodiak filebeat-2016.07.24 2 r UNASSIGNED filebeat-2016.07.24 1 p STARTED 84036 13.1mb 127.0.0.1 Zodiak filebeat-2016.07.24 1 r UNASSIGNED filebeat-2016.07.24 0 p STARTED 84295 13.1mb 127.0.0.1 Zodiak filebeat-2016.07.24 0 r UNASSIGNED filebeat-2016.07.23 3 p STARTED 107 143.7kb 127.0.0.1 Zodiak filebeat-2016.07.23 3 r UNASSIGNED filebeat-2016.07.23 4 p STARTED 99 132.8kb 127.0.0.1 Zodiak filebeat-2016.07.23 4 r UNASSIGNED filebeat-2016.07.23 2 p STARTED 85 121.9kb 127.0.0.1 Zodiak filebeat-2016.07.23 2 r UNASSIGNED filebeat-2016.07.23 1 p STARTED 90 132.4kb 127.0.0.1 Zodiak filebeat-2016.07.23 1 r UNASSIGNED filebeat-2016.07.23 0 p STARTED 108 135.6kb 127.0.0.1 Zodiak filebeat-2016.07.23 0 r UNASSIGNED filebeat-2016.07.26 3 p STARTED 12484844 7.5gb 127.0.0.1 Zodiak filebeat-2016.07.26 3 r UNASSIGNED filebeat-2016.07.26 4 p STARTED 12483196 7.5gb 127.0.0.1 Zodiak filebeat-2016.07.26 4 r UNASSIGNED filebeat-2016.07.26 2 p STARTED 12484632 7.5gb 127.0.0.1 Zodiak filebeat-2016.07.26 2 r UNASSIGNED filebeat-2016.07.26 1 p STARTED 12484105 7.5gb 127.0.0.1 Zodiak filebeat-2016.07.26 1 r UNASSIGNED filebeat-2016.07.26 0 p STARTED 12476659 7.5gb 127.0.0.1 Zodiak filebeat-2016.07.26 0 r UNASSIGNED filebeat-2016.07.25 3 p STARTED 29726030 19.1gb 127.0.0.1 Zodiak filebeat-2016.07.25 3 r UNASSIGNED filebeat-2016.07.25 4 p STARTED 29733489 19.1gb 127.0.0.1 Zodiak filebeat-2016.07.25 4 r UNASSIGNED filebeat-2016.07.25 2 p STARTED 29725998 19.1gb 127.0.0.1 Zodiak filebeat-2016.07.25 2 r UNASSIGNED filebeat-2016.07.25 1 p STARTED 29728050 19.1gb 127.0.0.1 Zodiak filebeat-2016.07.25 1 r UNASSIGNED filebeat-2016.07.25 0 p STARTED 29724972 19gb 127.0.0.1 Zodiak filebeat-2016.07.25 0 r UNASSIGNED .kibana 0 p STARTED 31 54kb 127.0.0.1 Zodiak .kibana 0 r UNASSIGNED filebeat-2016.07.28 3 p STARTED 2041727 1.1gb 127.0.0.1 Zodiak filebeat-2016.07.28 3 r UNASSIGNED filebeat-2016.07.28 4 p STARTED 2040642 1.1gb 127.0.0.1 Zodiak filebeat-2016.07.28 4 r UNASSIGNED filebeat-2016.07.28 2 p STARTED 2038883 1.1gb 127.0.0.1 Zodiak filebeat-2016.07.28 2 r UNASSIGNED filebeat-2016.07.28 1 p STARTED 2038639 1.1gb 127.0.0.1 Zodiak filebeat-2016.07.28 1 r UNASSIGNED filebeat-2016.07.28 0 p STARTED 2035531 1.1gb 127.0.0.1 Zodiak filebeat-2016.07.28 0 r UNASSIGNED filebeat-2016.07.27 3 p STARTED 6865087 3.8gb 127.0.0.1 Zodiak filebeat-2016.07.27 3 r UNASSIGNED filebeat-2016.07.27 4 p STARTED 6865035 3.8gb 127.0.0.1 Zodiak filebeat-2016.07.27 4 r UNASSIGNED filebeat-2016.07.27 2 p STARTED 6869406 3.8gb 127.0.0.1 Zodiak filebeat-2016.07.27 2 r UNASSIGNED filebeat-2016.07.27 1 p STARTED 6865712 3.8gb 127.0.0.1 Zodiak filebeat-2016.07.27 1 r UNASSIGNED filebeat-2016.07.27 0 p STARTED 6865905 3.8gb 127.0.0.1 Zodiak filebeat-2016.07.27 0 r UNASSIGNED filebeat-2016.07.22 3 p STARTED 3353 2.4mb 127.0.0.1 Zodiak filebeat-2016.07.22 3 r UNASSIGNED filebeat-2016.07.22 4 p STARTED 3362 2.5mb 127.0.0.1 Zodiak filebeat-2016.07.22 4 r UNASSIGNED filebeat-2016.07.22 2 p STARTED 3446 2.5mb 127.0.0.1 Zodiak filebeat-2016.07.22 2 r UNASSIGNED filebeat-2016.07.22 1 p STARTED 3443 2.5mb 127.0.0.1 Zodiak filebeat-2016.07.22 1 r UNASSIGNED filebeat-2016.07.22 0 p STARTED 3393 2.5mb 127.0.0.1 Zodiak filebeat-2016.07.22 0 r UNASSIGNED

anhlqn · July 28, 2016, 4:27pm

What are the folder names in this path?

What is the node.name: in your ES config file?
The node.name must match to folder name inside /data/var/lib/elasticsearch

Can you post the content of elasticsearch.yml here?

infinitetutts · July 29, 2016, 8:57am

This is my full elasticsearch.yml file :

network.host: 127.0.0.1 path.data: /data/var/lib/elasticsearch

Directory content:
$ ls /data/var/lib/elasticsearch/elasticsearch/nodes/ 0 1 $ ls /data/var/lib/elasticsearch/elasticsearch/nodes/1/ _state indices node.lock $ ls /data/var/lib/elasticsearch/elasticsearch/nodes/0/ _state indices node.lock $ ls /data/var/lib/elasticsearch/elasticsearch/nodes/0/indices/ filebeat-2016.07.14 filebeat-2016.07.16 filebeat-2016.07.18 filebeat-2016.07.20 filebeat-2016.07.22 filebeat-2016.07.15 filebeat-2016.07.17 filebeat-2016.07.19 filebeat-2016.07.21 $ ls /data/var/lib/elasticsearch/elasticsearch/nodes/1/indices/ filebeat-2016.07.22 filebeat-2016.07.23 filebeat-2016.07.24 filebeat-2016.07.25 filebeat-2016.07.26 filebeat-2016.07.27 filebeat-2016.07.28 filebeat-2016.07.29

Ah I see where my old data is !

Could you please help me configure my elasticsearch.yml file ? I think that should do the trick ?

anhlqn · July 29, 2016, 3:13pm

Based on these, your cluster is working fine. There are unassigned shards (all r or replica) because you are running only one ES node. Add another ES node to your cluster and those shards will be assigned to the second node.

It appears to me that you are trying to add a second node to your cluster, but network.publish_host: is set to 127.0.0.1 by default, and nodes cannot communicate with each other. A sample elasticsearch.yml for 3 node cluster:

https://gist.github.com/anhlqn/60bb3dc3134e141eafa718add195336b#file-elasticsearch-yml

polyfractal · July 29, 2016, 3:27pm

This isn't true, just FYI. Node names are uncoupled to the data directory. The data directory will always use node numbers (0, 1, 2, etc). Perhaps you're thinking of the cluster name, which must match the cluster name in the data directory?

Default ES installations randomly pick node names from a long list of Marvel comic characters, which is why you're seeing them change after restart.

I just quickly skimmed the thread and that caught my eye. I'll re-read the whole thing and see if I can offer some help

anhlqn · July 29, 2016, 7:05pm

Yup, my bad on this. I got caught up with my custom path settings on Windows

infinitetutts · August 1, 2016, 2:29pm

Thanks guys.

How do I setup 2 nodes on the same machine ?

https://stackoverflow.com/questions/13477303/multiple-nodes-in-elasticsearch

I get the following error though:
/usr/share/elasticsearch/bin -Des.config=/etc/elasticsearch/elasticsearch2.yml

Exception in thread "main" ElasticsearchException[Failed to load logging configuration]; nested: NoSuchFileException[/usr/share/elasticsearch/config]; Likely root cause: java.nio.file.NoSuchFileException: /usr/share/elasticsearch/config

Topic		Replies	Views
Failed to recover from translog Elasticsearch	3	2043	July 5, 2017
IndexShardRecoveryException failed to recovery from gateway Elasticsearch	3	3030	July 5, 2017
Elasticsearch Fills Logs with Error Messages When Shard Fails to Recover Elasticsearch	3	1991	July 5, 2017
How to recover indices from missing translog file Elasticsearch	2	3038	July 5, 2017
Master node failure causes cluster to fail Elasticsearch	3	1645	July 6, 2017

"no known master node, scheduling a retry"

Related topics