Elasticsearch appears to be down but isn't and status is green


(Peter) #1

I recently upgraded from ELK 5.x to 6.1. It has been running fine. We have 12 data nodes -- six at each site and have one replica. We keep the logs for 30 days and have 11 billion documents using 15TB of storage. We put different kinds of logs in different indexes ie F5, firewall, IIS, Linux. This was just to decrease the noise a bit. Depending on the amount of data in each index some are one shard up to 5.

I'm getting errors like this in logstash:

[2018-02-01T10:41:48,087][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://elastic:xxxxxx@localhost:9200/][Manticore::SocketTimeout] Read timed out {:url=>http://elastic:xxxxxx@localhost:9200/, :error_message=>"Elasticsearch Unreachable: [http://elastic:xxxxxx@localhost:9200/][Manticore::SocketTimeout] Read timed out", :error
_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}

[2018-02-01T10:41:48,087][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://elastic:
xxxxxx@localhost:9200/][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>64}

I don't see any corresponding error in the elasticsearch log. I was shipping to the data nodes but tried the other day to install ES on the shippers and then send to localhost which is not a data node. That seemed to be working fine but at 5:50 this morning it stopped shipping and the above errors started showing up again. The ES cluster is green. In case it matters Curator runs at 5am.

The data nodes have 48GB RAM with 28GB heap.

From the logstash config there I have this set:

pipeline.workers: 6
pipeline.batch.size: 500
pipeline.batch.delay: 5

It is a 4 vCPU VM and doesn't seem taxed.

In ES I have this:

cluster.routing.allocation.awareness.attributes: site
cluster.routing.allocation.awareness.force.zone.values: site1, site2
discovery.zen.minimum_master_nodes: 1

With one master.

I'm not sure what addional information would be helpful. It is running on RHEL 7 using the RPM install and configured by the elastic Puppet forge module.

Any help would be appreciated.

Thanks,

Peter


(Mark Walkom) #2

We’ve renamed ELK to the Elastic Stack, otherwise Beats and APM feel left out! :wink: Check out https://www.elastic.co/elk-stack

That's bad, you risk a split brain and you should increase your master count. See https://www.elastic.co/guide/en/elasticsearch/guide/2.x/important-configuration-changes.html#_minimum_master_nodes.

As in a coordinating node? There's nothing in the Elasticsearch logs of that node?


(Peter) #3

I just shutdown Logstash and ES and then started ES, waited, and started LS. Redis has plenty in it so LS has started processing logs and shipping them to ES and all is working. At the bottom of my LS log I see things like this:

[2018-02-02T09:18:08,584][WARN ][logstash.filters.dns ] DNS filter could not perform reverse lookup on missing field {:field=>"f5_server_fqdn"}
[2018-02-02T09:18:09,068][ERROR][logstash.filters.dns ] DNS: timeout on resolving address. {:field=>"f5_client_fqdn", :value=>"204.155.57.254"}

Now I'll wait.

Now I have this. I'm including the error preceeding the timeout in case it is helpful:

[2018-02-02T09:21:28,237][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>"agentsmith.uso.bor.usg.edu", :_index=>"noodle-2018.02.02", :_type=>"doc", :_routing
=>nil}, #LogStash::Event:0x6545311a], :response=>{"index"=>{"_index"=>"noodle-2018.02.02", "_type"=>"doc", "_id"=>"agentsmith.uso.bor.usg.edu", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to pa
rse [puppetlastrun]", "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"For input string: "20180202-071736""}}}}}

[2018-02-02T09:21:35,276][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://elastic:xxxxxx@localho
st:9200/][Manticore::SocketTimeout] Read timed out {:url=>http://elastic:xxxxxx@localhost:9200/, :error_message=>"Elasticsearch Unreachable: [http://elastic:xxxxxx@localhost:9200/][Manticore::SocketTimeout] Read timed out", :error
_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}

[2018-02-02T09:21:35,278][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://elastic:
xxxxxx@localhost:9200/][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}

[2018-02-02T09:21:36,696][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://elastic:xxxxxx@localhost:9200/"}

But if I run 'curl -s http://localhost:9200/_cluster/health?pretty=true' everything looks great.

The last two log messages (with some manual domain changes) in the ES log from the same machine, which is where it is trying to connect, are these:

[2018-02-02T09:17:56,567][INFO ][o.e.c.s.ClusterApplierService] [dpindexer1] added {{dpindexer10}{aDppc_YoRMmqrj9A__ledQ}{xpunwy8oRTaT_kkqk6xmtg}{dpindexer10.example.com}{172.16.59.31:9300}{ml.machine_memory=12484726784, site=db30
0, ml.max_open_jobs=20, ml.enabled=true},{dpkibana}{Wdj2nj8WRTipdF7c0PzkYw}{xUSroVP5S1OY4thTwsaWiA}{dpkibana.example.com}{172.16.59.157:9300}{ml.machine_memory=12484726784, site=db300, ml.max_open_jobs=20, ml.enabled=true},}, reas
on: apply cluster state (from master [master {upesdata6}{5YFroZKPQ22xqStggrxPKQ}{soFlF1b-QW22SwUQcH7wtA}{upesdata6.bkins.example.com}{10.11.70.20:9300}{ml.machine_memory=50466050048, site=uga, ml.max_open_jobs=20, ml.enabled=true}
committed version [715259]])

[2018-02-02T09:17:58,024][INFO ][o.e.c.s.ClusterApplierService] [dpindexer1] added {{dpindexer2}{u4R8kNH9Qhm5ue7It8TH6Q}{YrWCJUgAQN6M5B4sxqgANA}{dpindexer2.example.com}{172.16.59.154:9300}{ml.machine_memory=12480905216, site=db300
, ml.max_open_jobs=20, ml.enabled=true},}, reason: apply cluster state (from master [master {upesdata6}{5YFroZKPQ22xqStggrxPKQ}{soFlF1b-QW22SwUQcH7wtA}{upesdata6.bkins.example.com}{10.11.70.20:9300}{ml.machine_memory=50466050048,
site=uga, ml.max_open_jobs=20, ml.enabled=true} committed version [715261]])

It seems fine -- no errors.

Thanks,

Peter


(Mark Walkom) #4

And nothing in the Elasticsearch logs?


(Peter) #5

No -- nothing in the ES logs.

However, there might be a breakthrough. It is possible that the host based firewall was wonky. It was allowing connections but may have been inhibiting it some. It is turned off now and everything is working fine. However, everything was working fine before and then stopped.

I'll see if it continues to work fine or if it is just a temporary respite.

Peter


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.