I Suspect my Elasticsearch is not setup correctly to handle the massive amounts of logs I have.
Logs of this morning:
[2016-07-22 04:01:50,401][WARN ][cluster.action.shard ] [Margali Szardos] [filebeat-2016.07.22][1] received shard failed for target shard [[filebeat-2016.07.22][1], node[8Axw9bfLQ1ejwYGk67tnMg], [P], v[4], s[INITIALIZING], a[id=M8E1OD30SFOCSflBjG4wgw], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-07-22T02:54:28.460Z], details[failed recovery, failure IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: OutOfMemoryError[Java heap space]; ]]], indexUUID [44N8S3trRMiecFHILJYO_w], message [failed recovery], failure [IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: OutOfMemoryError[Java heap space]; ]
[filebeat-2016.07.22][[filebeat-2016.07.22][1]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: OutOfMemoryError[Java heap space];
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:250)
at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: [filebeat-2016.07.22][[filebeat-2016.07.22][1]] EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: OutOfMemoryError[Java heap space];
Caused by: java.lang.OutOfMemoryError: Java heap space
When I try start Elasticsearch:
[2016-07-22 14:36:58,579][INFO ][node ] [Aldebron] version[2.3.4], pid[40853], build[e455fd0/2016-06-30T11:24:31Z]
[2016-07-22 14:36:58,579][INFO ][node ] [Aldebron] initializing ...
[2016-07-22 14:36:59,138][INFO ][plugins ] [Aldebron] modules [reindex, lang-expression, lang-groovy], plugins , sites
[2016-07-22 14:36:59,164][INFO ][env ] [Aldebron] using [1] data paths, mounts [[/data (/dev/sdb1)]], net usable_space [1.2tb], net total_space [1.6tb], spins? [possibly], types [xfs]
[2016-07-22 14:36:59,164][INFO ][env ] [Aldebron] heap size [30.8gb], compressed ordinary object pointers [true]
[2016-07-22 14:36:59,164][WARN ][env ] [Aldebron] max file descriptors [65535] for elasticsearch process likely too low, consider increasing to at least [65536]
[2016-07-22 14:37:01,826][INFO ][node ] [Aldebron] initialized
[2016-07-22 14:37:01,826][INFO ][node ] [Aldebron] starting ...
[2016-07-22 14:37:02,613][INFO ][transport ] [Aldebron] publish_address {127.0.0.1:9301}, bound_addresses {127.0.0.1:9301}
[2016-07-22 14:37:02,629][INFO ][discovery ] [Aldebron] elasticsearch/akJzSybBRlObSKslNr3dVQ
[2016-07-22 14:37:32,632][WARN ][discovery ] [Aldebron] waited for 30s and no initial state was set by the discovery
[2016-07-22 14:37:32,679][INFO ][http ] [Aldebron] publish_address {127.0.0.1:9200}, bound_addresses {127.0.0.1:9200}
[2016-07-22 14:37:32,679][INFO ][node ] [Aldebron] started
[2016-07-22 14:37:34,473][DEBUG][action.admin.indices.create] [Aldebron] no known master node, scheduling a retry
[2016-07-22 14:37:50,729][INFO ][discovery.zen ] [Aldebron] failed to send join request to master [{Margali Szardos}{8Axw9bfLQ1ejwYGk67tnMg}{127.0.0.1}{127.0.0.1:9300}], reason [RemoteTransportException[[Margali Szardos][127.0.0.1:9300][internal:discovery/zen/join]]; nested: ConnectTransportException[[Aldebron][127.0.0.1:9301] connect_timeout[30s]]; ]
[2016-07-22 14:38:34,476][DEBUG][action.admin.indices.create] [Aldebron] timed out while retrying [indices:admin/create] after failure (timeout [1m])
[2016-07-22 14:38:34,483][WARN ][rest.suppressed ] path: /_bulk, params: {}
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];]
How are you starting elasticsearch? The /etc/default/elasticsearch setting looks ok, but it does not appear to be used - since your htop is showing -Xms256m -Xmx1g.
My Elasticsearch sorted itself out last night. Its back up and running but I lost all my data.
elasticsearch does not work like that, it is either configured correctly or not. I suspect that you are starting elasticsearch manually using bin/elasticsearch ..., rather than as the service. Starting as the service will include / use the /etc/default/elasticsearch and /etc/elasticsearch/elasticsearch.yml. These will use default locations (unless otherwise configured): Directory layout | Elasticsearch Guide [8.11] | Elastic
If you started manually, these default locations would be used. You probably did not loose all your data, rather it is at a different location (and possibly under a different cluster name, ex. default = elasticsearch).
My elasticsearch was working fine for a week before this happened. Then I lost my data, I tried restarting it and a few things nothing worked. And the next day it was working but all my data was lost.
I am running elasticsearch as a service and only use the service command:
`service elasticsearch status
elasticsearch is running`
I can still see that my data is in the location I set it to be on. But I cant access it within elasticsearch.
Do you know how to prevent elasticsearch from doing this again ? What do I need to setup to handle masses of logs ?
Does you current cluster name and node name match those before you restart ES?
How many ES nodes are you running?
Your logs are not really big, so they should not be the cause of the issue. I think it should be something with your ES config. Two ES nodes can easily handle TB of logs.
Why would you have different node name after restart? Did you change the node.name: from Zodiak to Aldebron in the config file before restart?
Have you renamed the data folder to match the new node name and updated the path.data? An easier way is to update your node.name: to Zodiak and restart ES.
How may shares and nodes do I need for my setup ? And are there any other settings I need to set in the config file ?
$ curl -XGET "http://localhost:9200/_cat/shards?v" index shard prirep state docs store ip node filebeat-2016.07.24 3 p STARTED 84504 13.3mb 127.0.0.1 Zodiak filebeat-2016.07.24 3 r UNASSIGNED filebeat-2016.07.24 4 p STARTED 83849 13.1mb 127.0.0.1 Zodiak filebeat-2016.07.24 4 r UNASSIGNED filebeat-2016.07.24 2 p STARTED 84337 13.2mb 127.0.0.1 Zodiak filebeat-2016.07.24 2 r UNASSIGNED filebeat-2016.07.24 1 p STARTED 84036 13.1mb 127.0.0.1 Zodiak filebeat-2016.07.24 1 r UNASSIGNED filebeat-2016.07.24 0 p STARTED 84295 13.1mb 127.0.0.1 Zodiak filebeat-2016.07.24 0 r UNASSIGNED filebeat-2016.07.23 3 p STARTED 107 143.7kb 127.0.0.1 Zodiak filebeat-2016.07.23 3 r UNASSIGNED filebeat-2016.07.23 4 p STARTED 99 132.8kb 127.0.0.1 Zodiak filebeat-2016.07.23 4 r UNASSIGNED filebeat-2016.07.23 2 p STARTED 85 121.9kb 127.0.0.1 Zodiak filebeat-2016.07.23 2 r UNASSIGNED filebeat-2016.07.23 1 p STARTED 90 132.4kb 127.0.0.1 Zodiak filebeat-2016.07.23 1 r UNASSIGNED filebeat-2016.07.23 0 p STARTED 108 135.6kb 127.0.0.1 Zodiak filebeat-2016.07.23 0 r UNASSIGNED filebeat-2016.07.26 3 p STARTED 12484844 7.5gb 127.0.0.1 Zodiak filebeat-2016.07.26 3 r UNASSIGNED filebeat-2016.07.26 4 p STARTED 12483196 7.5gb 127.0.0.1 Zodiak filebeat-2016.07.26 4 r UNASSIGNED filebeat-2016.07.26 2 p STARTED 12484632 7.5gb 127.0.0.1 Zodiak filebeat-2016.07.26 2 r UNASSIGNED filebeat-2016.07.26 1 p STARTED 12484105 7.5gb 127.0.0.1 Zodiak filebeat-2016.07.26 1 r UNASSIGNED filebeat-2016.07.26 0 p STARTED 12476659 7.5gb 127.0.0.1 Zodiak filebeat-2016.07.26 0 r UNASSIGNED filebeat-2016.07.25 3 p STARTED 29726030 19.1gb 127.0.0.1 Zodiak filebeat-2016.07.25 3 r UNASSIGNED filebeat-2016.07.25 4 p STARTED 29733489 19.1gb 127.0.0.1 Zodiak filebeat-2016.07.25 4 r UNASSIGNED filebeat-2016.07.25 2 p STARTED 29725998 19.1gb 127.0.0.1 Zodiak filebeat-2016.07.25 2 r UNASSIGNED filebeat-2016.07.25 1 p STARTED 29728050 19.1gb 127.0.0.1 Zodiak filebeat-2016.07.25 1 r UNASSIGNED filebeat-2016.07.25 0 p STARTED 29724972 19gb 127.0.0.1 Zodiak filebeat-2016.07.25 0 r UNASSIGNED .kibana 0 p STARTED 31 54kb 127.0.0.1 Zodiak .kibana 0 r UNASSIGNED filebeat-2016.07.28 3 p STARTED 2041727 1.1gb 127.0.0.1 Zodiak filebeat-2016.07.28 3 r UNASSIGNED filebeat-2016.07.28 4 p STARTED 2040642 1.1gb 127.0.0.1 Zodiak filebeat-2016.07.28 4 r UNASSIGNED filebeat-2016.07.28 2 p STARTED 2038883 1.1gb 127.0.0.1 Zodiak filebeat-2016.07.28 2 r UNASSIGNED filebeat-2016.07.28 1 p STARTED 2038639 1.1gb 127.0.0.1 Zodiak filebeat-2016.07.28 1 r UNASSIGNED filebeat-2016.07.28 0 p STARTED 2035531 1.1gb 127.0.0.1 Zodiak filebeat-2016.07.28 0 r UNASSIGNED filebeat-2016.07.27 3 p STARTED 6865087 3.8gb 127.0.0.1 Zodiak filebeat-2016.07.27 3 r UNASSIGNED filebeat-2016.07.27 4 p STARTED 6865035 3.8gb 127.0.0.1 Zodiak filebeat-2016.07.27 4 r UNASSIGNED filebeat-2016.07.27 2 p STARTED 6869406 3.8gb 127.0.0.1 Zodiak filebeat-2016.07.27 2 r UNASSIGNED filebeat-2016.07.27 1 p STARTED 6865712 3.8gb 127.0.0.1 Zodiak filebeat-2016.07.27 1 r UNASSIGNED filebeat-2016.07.27 0 p STARTED 6865905 3.8gb 127.0.0.1 Zodiak filebeat-2016.07.27 0 r UNASSIGNED filebeat-2016.07.22 3 p STARTED 3353 2.4mb 127.0.0.1 Zodiak filebeat-2016.07.22 3 r UNASSIGNED filebeat-2016.07.22 4 p STARTED 3362 2.5mb 127.0.0.1 Zodiak filebeat-2016.07.22 4 r UNASSIGNED filebeat-2016.07.22 2 p STARTED 3446 2.5mb 127.0.0.1 Zodiak filebeat-2016.07.22 2 r UNASSIGNED filebeat-2016.07.22 1 p STARTED 3443 2.5mb 127.0.0.1 Zodiak filebeat-2016.07.22 1 r UNASSIGNED filebeat-2016.07.22 0 p STARTED 3393 2.5mb 127.0.0.1 Zodiak filebeat-2016.07.22 0 r UNASSIGNED
Based on these, your cluster is working fine. There are unassigned shards (all r or replica) because you are running only one ES node. Add another ES node to your cluster and those shards will be assigned to the second node.
It appears to me that you are trying to add a second node to your cluster, but network.publish_host: is set to 127.0.0.1 by default, and nodes cannot communicate with each other. A sample elasticsearch.yml for 3 node cluster:
This isn't true, just FYI. Node names are uncoupled to the data directory. The data directory will always use node numbers (0, 1, 2, etc). Perhaps you're thinking of the cluster name, which must match the cluster name in the data directory?
Default ES installations randomly pick node names from a long list of Marvel comic characters, which is why you're seeing them change after restart.
I just quickly skimmed the thread and that caught my eye. I'll re-read the whole thing and see if I can offer some help
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.