Indices are missing. Help!

Matthew_Eash · January 7, 2015, 9:15pm

I have a 3 node ES 1.4.1 cluster that runs on CentOS6, Oracle JDK 1.7.0_67.
Heap was set to 20G of the 32G on those boxes, with mlockall set.
Configuration is currently set more towards bulk loading more than it is
searching. Purpose of the ES cluster was for time-series indexing of
logged metrics. I originally had one larger index (1.1B docs) tracking a
high frequency metric over the past several months, but recently changed
schema design to do an index per-day. I was loading additional metrics as
well as reimporting the data in that larger index into per-day. ES search
usage is very light at the moment.

Last night, I had finished a multi-day bulk import of several months worth
of multiple log metrics into per-day indices. The per-day indices were all
either 12M or 48M records with settings of {shards=4, replication=0,
refresh_interval=-1} while I bulk loaded. After a day was fully loaded in
bulk and no more writes necessary, I was optimizing each to 1 segment
(taking 30-45s), then ultimately was going to set
{replication=1,refresh_interval=30s} once all were individually optimized.
As of last night, I was about 1/4 of the way through optimizing, and none
of them (beyond the larger index) were replicated.

After bulk import was done, I was poking around ES API, not really doing
anything extraordinary (some searches, some optimization/merges of
individual per-day indexes that I had done even while bulk importing). At
that time, some event ultimately spun out 2 of the nodes, making them
inaccessible. I'm still trying to diagnose what exactly occurred - this
not the first occurrence of this mystery spin out of a node, but never had
2 go at once. I believe the JVM is locking up the kernel some how. I could
ping them, but could not access the machines in any way. Through the night

it seems the inaccessible machines occasionally attempted to reestablish
the cluster only to disappear again. The remaining node just flailed,
attempting to establish master most of the time.

This morning, I had to have the machines physically rebooted at the
console, as they were still unresponsive.

So - I'm still trying to diagnose what exactly went wrong. I do recall
seeing the heap size on all the nodes start growing to about double the 20G
I had assigned - but am unsure if that caused whatever freeze up occurred.
(Would love to know where to start looking.)

However, my more immediate issue.... when the cluster came back up after
reboot, only 1 index is showing, my original 1.1B-doc larger, replicated
index. * All of my daily per-day indexes created over the past 2 weeks are
completely missing in ES.* /_cat/indices yesterday, showed 276 happy
green indexes, today it shows only 1. After looking at the raw data
directories (split across 2 volumes on local spinning disks), it's all
still there... all index directories exist and within them I see all the
raw Lucene shard dirs and segment files.

Since the cluster reboot, only this stands out in the logs, from the master
node:
[2015-01-07 12:22:14,348][INFO ][gateway ] [node3]
recovered [1] indices into cluster_state
[2015-01-07 12:22:14,440][INFO ][indices.store ] [node3] Failed
to open / find files while reading metadata snapshot

Subsequent reboots only show 1 indices recovered and don't have the
metadata failure message.

Is there any way to fix the index metadata to reestablish the indices that
were all there yesterday, and still exist on the disk? How do I go about
cleaning this up? I am finding nothing in ES documentation talking about
internal index metadata (where it's stored, how to fix corruption, or
anything about this error message).

I want to root cause the node failures that occurred - but that is likely a
deep issue that will take a while to research/diagnose. My more immediate
need is getting those indexes back first! Any attempt to see or deal with
those indices now gets an IndexMissingException.

My only clue in why this occurred thus far is that one of the failing nodes
kept trying to reestablish a 2-node cluster with itself as master through
the night with the lone working node, then kept failing and dropping the
other node from cluster. During that time and after the new master found
itself alone, this appeared in log for many of the per-day indexes:
[2015-01-07 00:05:20,254][DEBUG][action.admin.indices.stats] [node1]
[temp-2014-11-14][3], node[fwGNfUZJTmmkAj4hpCobWg], [P], s[STARTED]: failed
to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@7b542938]
org.elasticsearch.transport.NodeDisconnectedException:
[node3][inet[/172.16.0.34:9300]][indices:monitor/stats[s]] disconnected

This occurred again 2 hours later. Would the master then expel the index
after stat request failures?

Any assistance would be greatly appreciated! Cluster is behaving fine at
the moment now that nodes were rebooted, just is missing 275 indexes that
are there...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e0477411-bf07-4e6c-9d56-4db81a4d6798%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Matthew_Eash · January 9, 2015, 10:01pm

As a followup - seems there is a major issue in indices created from a
mapping template. I installed a fresh copy of ES 1.4.2 standalone on my
laptop and replicated the issue I had on my ES 1.4.1 cluster --
disappearing indices on cluster restart.

Would love some insight from ES devs on if it's possible to get the
"disappeared" indices back and visible in ES.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3658ec01-7901-4054-942e-52509d77bafe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
ElasticSearch with > 40 nodes, missing shards and indexing troubles Elasticsearch	11	615	July 6, 2017
Hundreds of indices missing Elasticsearch	3	283	July 6, 2017
ES Ate My Shards/Indexes Elasticsearch	13	533	July 6, 2017
Performance degrading after a couple of weeks Elasticsearch	7	525	October 30, 2018
ElasticSearch nodes not responding anymore - please help! Elasticsearch	9	3767	July 5, 2017

Indices are missing. Help!

Related topics