Elasticsearch Cluster D/C Issues. One sees the other the others doesn't


(Michel Laporte) #1

Hi,

I have a bit of an issue.

I have ES running on a cluster at the office consisting of 4 nodes. (2 Data and 3 master eligible)

I am using Graylog as the data collectors and ES to store the information.

This is my current Set up

Data Nodes (Node Names):
Gray-003
Gray-006

Master Eligible (Node Names):
Gray-003 (Data and Master)
Gray-004 - (Master only no data)
Gray-005 - (Master only no data)

My problem is that all the other ES can see Gray-006 . . but every few hours, Gray-006 doesn't see the other ES Servers Weird right?


(This picture shoes KOPF from Gray-003. It can see Gray-006 and see its stats)

However, KOPF on Gray-006 says its offline (192.168.32.73)

After a few minutes or so it connects back up and KOPF identifies itself again.

Now all other ES servers are at a Data centre, the data only node (Gray-006) is at our office and is connected over a leased line. The line has no issues what so ever as we run all the office file servers etc from the DC with no issues at all.

When i cycle the deflector to create a indice (As i have 4 primary and 2 replicas) they get evenly spreaded out over the 2 data nodes. However, as i have a automatic rotation every 24 hours, when it does so, all the primary shards get allocated to the Gray-003 node and all replicas reside on Gray-006.

I have tried upping the zen ping timeout on Gray-006 but it still d/c's from the cluster every few hours and when it comes back up, i can't search the ES data from Graylog as it says its out of Sync.

On Gray-006 ES Logs, i keep getting:

[2016-01-18 10:00:30,131][DEBUG][action.admin.cluster.node.stats] [ess-ukh-gray-006_DATA] failed to execute on node [_ahdlW2dTjiMT2KHhGMprA]
org.elasticsearch.transport.NodeDisconnectedException: [gray004][inet[/192.168.16.131:9350]][cluster:monitor/nodes/stats[n]] disconnected

Any help is greatly appreciated,

Thank you,

P.S

I also keep getting transport disconnect errors when Gray-006 gets removed from the cluster

[2016-01-18 11:14:47,583][INFO ][discovery.zen ] [ess-ukh-gray-006_DATA] master_left [[ess-lon-gray-004s][EBo7CIrURj-Su2uV3vozKw][ess-lon-gray-004][inet[/192.168.16.131:9300]]{data=false, master=true}], reason [transport disconnected]
[2016-01-18 11:14:47,585][WARN ][discovery.zen ] [ess-ukh-gray-006_DATA] master left (reason = transport disconnected), current nodes: {[ess-ukh-gray-006_DATA][HJBoxbBKRiibKVnJe3S2qA][ess-ukh-gray-006][inet[/192.168.32.73:9300]]{master=false},[gray-006_Data][frkMUcP9Sma-VsJv2mPvQQ][ess-ukh-gray-006][inet[/192.168.32.73:9350]]{client=true, data=false, master=false},[gray004][_ahdlW2dTjiMT2KHhGMprA][ess-lon-gray-004][inet[/192.168.16.131:9350]]{client=true, data=false, master=false},[gray-003][EwImG6QwS4Clqmsaw3snXA][ess-lon-gray-003][inet[/192.168.16.130:9350]]{client=true, data=false, master=false},[ess-lon-gray-005][lKqyqfY6SzeYg3ejXsDwcA][ess-lon-gray-005][inet[/192.168.16.132:9300]]{data=false, master=true},[ess-lon-gray-003_master][KV90ufrQQc-g1aPV2RoyGA][ess-lon-gray-003][inet[/192.168.16.130:9300]]{master=true},}
[2016-01-18 11:14:47,585][INFO ][cluster.service ] [ess-ukh-gray-006_DATA] removed {[ess-lon-gray-004s][EBo7CIrURj-Su2uV3vozKw][ess-lon-gray-004][inet[/192.168.16.131:9300]]{data=false, master=true},}, reason: zen-disco-master_failed ([ess-lon-gray-004s][EBo7CIrURj-Su2uV3vozKw][ess-lon-gray-004][inet[/192.168.16.131:9300]]{data=false, master=true})
[2016-01-18 11:14:50,887][DEBUG][action.admin.cluster.state] [ess-ukh-gray-006_DATA] no known master node, scheduling a retry
[2016-01-18 11:14:50,902][DEBUG][action.admin.cluster.state] [ess-ukh-gray-006_DATA] no known master node, scheduling a retry
[2016-01-18 11:14:50,903][DEBUG][action.admin.indices.get ] [ess-ukh-gray-006_DATA] no known master node, scheduling a retry
[2016-01-18 11:14:50,907][DEBUG][action.admin.cluster.health] [ess-ukh-gray-006_DATA] no known master node, scheduling a retry
[2016-01-18 11:15:02,623][INFO ][cluster.service ] [ess-ukh-gray-006_DATA] detected_master [ess-lon-gray-004s][

Thanks


(Mark Walkom) #2

That'd be why though.

Why is it setup like this?


(Michel Laporte) #3

Hi Warkolm,

I had a feeling it was. Issue is because we want the data to be stored on 2 different remote locations.

The Data Centre storage is on a SAN and the Office Storage is on a NAS. (We only have 1 storage network at each site).

Another reason was the syslog messages from the office gets stored on the office machine (Gray-006) and all the others at the DC gets stored on the Gray-003 Data node.

Is there another way we could do this? In terms of having the data striped across 2 locations? Purely being if lets say the SAN or the NAS has a failure, the data would still be backed up.

Thanks
P.S The link is a dedicated leased 1GB line between the two sites. (About 2 miles away).


(Mark Walkom) #4

https://www.elastic.co/blog/scaling_elasticsearch_across_data_centers_with_kafka has some better ideas.


(Michel Laporte) #5

Thank you so much Mark.

Having a read of the document, which one would you say would be best for our set up?

So far i'm liking the look of Independent Elasticsearch Clusters and A Shared Kafka Cluster.

However, the DC and the Office cluster is the same.

Would i have to make it as 2 independent clusters?

Thanks


(Jason Tedor) #6

Have you considered taking periodic snapshots and storing them outside of your datacenter (your office, S3, Azure Cloud Storage, etc.)? Then you can co-locate the nodes in your datacenter if the only reason for not is disaster recovery.


(Michel Laporte) #7

I think that might be the best bet.

Getting it to work on 1 cluster across 2 locations is going to be a pain unless on different clusters theres a workaround using Zookeeper by the looks of it.

I guess i'll create 2 Data nodes @ the DC and leave everything there. Logs from the Office will be sent over the Leaded Line over to the data nodes. Then will do a bi-weekly snapshot for disaster recovery procedures.

I guess that should sort it :slight_smile: .. right?


(Jason Tedor) #8

It doesn't hurt to take them more frequently; it comes down to your risk tolerance, and your operations procedures. After you take an initial snapshot, later snapshots piggyback on the earlier snapshots by only copying newly created segments (so snapshots are incremental). Be sure to practice restoring too, to ensure that you have everything in place correctly.

Yes. :slight_smile:


(Michel Laporte) #9

Amazing!

I will surely do that, reading up about it now :slight_smile:

thank you so much!!!!


(system) #10