Other ways to check cluster status? 9200 not responding, web UI mostly blank


(py) #1

(ES ver 19.3)

dearest ES wizards,

are there other ways to check on a cluster state/health/status other than
via curl 9200 or the web UI?
we have a cluster that need to be restarted (due to split brain masters),
and after startup, we are unable to reach 9200 nor the web UI.
(I am guessing the web UI fills data and the pretty hosts <-> index[shard#]
mappings from info gathered via 9200, hence why neither works?)

I would like to be able to see progress of any recovery/initialization/etc
going to see if this cluster is recoverable, before declaring it a loss
(and having to wipe/rebuild).
I am able to see a couple individual index states (ones i've found by
looking in the es-data dir for the indices), but anything node level or
cluster level fails, or just hangs.
after startup, I did tail each of the 10 hosts logs and saw them assign a
master, run some recoveries..... then queries started coming in... it
appears that some hosts have high load and are serving queries, while
others (perhaps with less hot shards, or maybe no recoverable indexes?) are
fairly idle and are spewing " Failed to execute fetch
phaseorg.elasticsearch.transport.Remote" ...
"org.elasticsearch.search.SearchContextMissingException: No search context
found for id [70373]" - just guessing a few indexes are missing shards or
just not avail?

side note:
we've been having a lot of issues recently with failed hosts, or hosts
dropping out [logs say timeouts] (presumably due to load or net issues).
i had to restart the cluster a few times to get 1 master to stick (had to
set some of the hotter nodes to node.master: false - otherwise the master
got too loaded and timedout causing various cluster hosts to assign a new
master). Is it possible our cluster has some corrupt states?, is just too
overloaded, or we've just got a bad configuration. HW all seems to check
out. beefy-ish 48G raid 10, 6 drive boxes 16G jvm (not sure why this
number).

much thanks for reading!
any input / advice / suggestions greatly appreciated

  • PU

--


(Otis Gospodnetić) #2

Hi,

You could use BigDesk or SPM for ElasticSearch (uses HBase in the
backend... I hear you StumbleUpon guys like HBase ;)) and maybe the elastic
head plugin to troubleshoot.
I assume you explored things like
http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-stats.html
.

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html

On Monday, October 8, 2012 4:37:33 AM UTC-4, py wrote:

(ES ver 19.3)

dearest ES wizards,

are there other ways to check on a cluster state/health/status other than
via curl 9200 or the web UI?
we have a cluster that need to be restarted (due to split brain masters),
and after startup, we are unable to reach 9200 nor the web UI.
(I am guessing the web UI fills data and the pretty hosts <->
index[shard#] mappings from info gathered via 9200, hence why neither
works?)

I would like to be able to see progress of any recovery/initialization/etc
going to see if this cluster is recoverable, before declaring it a loss
(and having to wipe/rebuild).
I am able to see a couple individual index states (ones i've found by
looking in the es-data dir for the indices), but anything node level or
cluster level fails, or just hangs.
after startup, I did tail each of the 10 hosts logs and saw them assign a
master, run some recoveries..... then queries started coming in... it
appears that some hosts have high load and are serving queries, while
others (perhaps with less hot shards, or maybe no recoverable indexes?) are
fairly idle and are spewing " Failed to execute fetch
phaseorg.elasticsearch.transport.Remote" ...
"org.elasticsearch.search.SearchContextMissingException: No search context
found for id [70373]" - just guessing a few indexes are missing shards or
just not avail?

side note:
we've been having a lot of issues recently with failed hosts, or hosts
dropping out [logs say timeouts] (presumably due to load or net issues).
i had to restart the cluster a few times to get 1 master to stick (had to
set some of the hotter nodes to node.master: false - otherwise the master
got too loaded and timedout causing various cluster hosts to assign a new
master). Is it possible our cluster has some corrupt states?, is just too
overloaded, or we've just got a bad configuration. HW all seems to check
out. beefy-ish 48G raid 10, 6 drive boxes 16G jvm (not sure why this
number).

much thanks for reading!
any input / advice / suggestions greatly appreciated

  • PU

--


(Lukáš Vlček) #3

Also you might want to look at paramedic plugin. Or build your own
customized web GUI, it requires some work but definitely possible.

Lukas
Dne 9.10.2012 5:32 "Otis Gospodnetic" otis.gospodnetic@gmail.com
napsal(a):

Hi,

You could use BigDesk or SPM for ElasticSearch (uses HBase in the
backend... I hear you StumbleUpon guys like HBase ;)) and maybe the elastic
head plugin to troubleshoot.
I assume you explored things like
http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-stats.html
.

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html

On Monday, October 8, 2012 4:37:33 AM UTC-4, py wrote:

(ES ver 19.3)

dearest ES wizards,

are there other ways to check on a cluster state/health/status other than
via curl 9200 or the web UI?
we have a cluster that need to be restarted (due to split brain masters),
and after startup, we are unable to reach 9200 nor the web UI.
(I am guessing the web UI fills data and the pretty hosts <->
index[shard#] mappings from info gathered via 9200, hence why neither
works?)

I would like to be able to see progress of any
recovery/initialization/etc going to see if this cluster is recoverable,
before declaring it a loss (and having to wipe/rebuild).
I am able to see a couple individual index states (ones i've found by
looking in the es-data dir for the indices), but anything node level or
cluster level fails, or just hangs.
after startup, I did tail each of the 10 hosts logs and saw them assign a
master, run some recoveries..... then queries started coming in... it
appears that some hosts have high load and are serving queries, while
others (perhaps with less hot shards, or maybe no recoverable indexes?) are
fairly idle and are spewing " Failed to execute fetch
phaseorg.elasticsearch.**transport.Remote" ...
"org.elasticsearch.search.**SearchContextMissingException: No search
context found for id [70373]" - just guessing a few indexes are missing
shards or just not avail?

side note:
we've been having a lot of issues recently with failed hosts, or hosts
dropping out [logs say timeouts] (presumably due to load or net issues).
i had to restart the cluster a few times to get 1 master to stick (had
to set some of the hotter nodes to node.master: false - otherwise the
master got too loaded and timedout causing various cluster hosts to assign
a new master). Is it possible our cluster has some corrupt states?, is
just too overloaded, or we've just got a bad configuration. HW all seems
to check out. beefy-ish 48G raid 10, 6 drive boxes 16G jvm (not sure
why this number).

much thanks for reading!
any input / advice / suggestions greatly appreciated

  • PU

--

--


(py) #4

Ha! Hbase!
I did look actually look at SPM, looks pretty darn cool!

Thanks Otis and Lukas for the suggestions!

btw turned out we hit a bug in .19.3 (we think), combined with the fact
that our servers were too hammered with query load that it was unable to
recover
after node failures, and there happened to cause port 9200 to be
nonresponsive. after we stopped query load, it allowed enough breathing
room for
the cluster to recover. also there were some crons still trying to build
indices against the split brain/master... i think we were pummeling
ourselves to death.

btw props to Shay... He's the f*king Man!

On Tuesday, October 9, 2012 3:20:25 AM UTC-7, Lukáš Vlček wrote:

Also you might want to look at paramedic plugin. Or build your own
customized web GUI, it requires some work but definitely possible.

Lukas
Dne 9.10.2012 5:32 "Otis Gospodnetic" <otis.gos...@gmail.com <javascript:>>
napsal(a):

Hi,

You could use BigDesk or SPM for ElasticSearch (uses HBase in the
backend... I hear you StumbleUpon guys like HBase ;)) and maybe the elastic
head plugin to troubleshoot.
I assume you explored things like
http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-stats.html
.

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html

On Monday, October 8, 2012 4:37:33 AM UTC-4, py wrote:

(ES ver 19.3)

dearest ES wizards,

are there other ways to check on a cluster state/health/status other
than via curl 9200 or the web UI?
we have a cluster that need to be restarted (due to split brain
masters), and after startup, we are unable to reach 9200 nor the web UI.
(I am guessing the web UI fills data and the pretty hosts <->
index[shard#] mappings from info gathered via 9200, hence why neither
works?)

I would like to be able to see progress of any
recovery/initialization/etc going to see if this cluster is recoverable,
before declaring it a loss (and having to wipe/rebuild).
I am able to see a couple individual index states (ones i've found by
looking in the es-data dir for the indices), but anything node level or
cluster level fails, or just hangs.
after startup, I did tail each of the 10 hosts logs and saw them assign
a master, run some recoveries..... then queries started coming in...
it appears that some hosts have high load and are serving queries, while
others (perhaps with less hot shards, or maybe no recoverable indexes?) are
fairly idle and are spewing " Failed to execute fetch
phaseorg.elasticsearch.**transport.Remote" ...
"org.elasticsearch.search.**SearchContextMissingException: No search
context found for id [70373]" - just guessing a few indexes are missing
shards or just not avail?

side note:
we've been having a lot of issues recently with failed hosts, or hosts
dropping out [logs say timeouts] (presumably due to load or net issues).
i had to restart the cluster a few times to get 1 master to stick (had
to set some of the hotter nodes to node.master: false - otherwise the
master got too loaded and timedout causing various cluster hosts to assign
a new master). Is it possible our cluster has some corrupt states?, is
just too overloaded, or we've just got a bad configuration. HW all seems
to check out. beefy-ish 48G raid 10, 6 drive boxes 16G jvm (not sure
why this number).

much thanks for reading!
any input / advice / suggestions greatly appreciated

  • PU

--

--


(system) #5