Disappearing Data and Unassigned Shards

It's become a recurring problem that our ES cluster crashes, and then when
we bring it back up, shards are missing. We keep tweaking and tuning
parameters, but it's mostly grasping at straws and it continues to
happen. Currently we're running 0.19.9 on a 16-node cluster of m1.xlarge
instances, with RAID0 across the four ephemeral drives (giving each node
1.6TB storage). The heap size for each machine is about half (8GB) of the
available memory (15GB).

We're generally happy with the performance, except for, you know, keeping
it running and alive :-/ Nothing in the logs seems to be an indicator.
Originally we were having some memory issues, but it turns out the heap
size environment variable wasn't getting properly set and that has since
been fixed. We ran into issues about the system mlock limit being set too
low, but that has also since been fixed. What seems to usually happen is
that a node will be unable to ping the master node, and then suddenly a few
more nodes will bite the dust, too. Sometimes bouncing elasticsearch on the
troubled boxes suffices, and sometimes we have to bounce the entire cluster.

Once everything comes back, _cluster/health reports that there are
unassigned shards, and they just sit there like that indefinitely. I've
been trawling through the data directory, hoping to find some clues. For
the shards that sit around unassigned, there's actually no corresponding
index files on /any/ machine:

# Comes back '0' for all of our machines for missing shards
pssh -i --host=fresh-search-{11..26} -l ec2-user 'find $ESDIR 

-wholename 2012-11-28/7 -type f | wc -l'
# Comes back as non-zero on each machine that has a copy of the shard
for OK shards
pssh -i --host=fresh-search-{11..26} -l ec2-user 'find $ESDIR
-wholename 2012-11-28/8 -type f | wc -l'

Suffice it to say, this is both disconcerting and frustrating. Any ideas?

--

Hello Dan,

What's the number of shards and replicas?

And what's your configuration regarding recovery? I'm thinking about the
ones listed here:
http://www.elasticsearch.org/guide/reference/modules/gateway/

Also, in the logs do you see something about dangling indices that will be
deleted?

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Thu, Nov 29, 2012 at 7:28 PM, Dan Lecocq dan@seomoz.org wrote:

It's become a recurring problem that our ES cluster crashes, and then when
we bring it back up, shards are missing. We keep tweaking and tuning
parameters, but it's mostly grasping at straws and it continues to
happen. Currently we're running 0.19.9 on a 16-node cluster of m1.xlarge
instances, with RAID0 across the four ephemeral drives (giving each node
1.6TB storage). The heap size for each machine is about half (8GB) of the
available memory (15GB).

We're generally happy with the performance, except for, you know, keeping
it running and alive :-/ Nothing in the logs seems to be an indicator.
Originally we were having some memory issues, but it turns out the heap
size environment variable wasn't getting properly set and that has since
been fixed. We ran into issues about the system mlock limit being set too
low, but that has also since been fixed. What seems to usually happen is
that a node will be unable to ping the master node, and then suddenly a few
more nodes will bite the dust, too. Sometimes bouncing elasticsearch on the
troubled boxes suffices, and sometimes we have to bounce the entire cluster.

Once everything comes back, _cluster/health reports that there are
unassigned shards, and they just sit there like that indefinitely. I've
been trawling through the data directory, hoping to find some clues. For
the shards that sit around unassigned, there's actually no corresponding
index files on /any/ machine:

# Comes back '0' for all of our machines for missing shards
pssh -i --host=fresh-search-{11..26} -l ec2-user 'find $ESDIR

-wholename 2012-11-28/7 -type f | wc -l'
# Comes back as non-zero on each machine that has a copy of the shard
for OK shards
pssh -i --host=fresh-search-{11..26} -l ec2-user 'find $ESDIR
-wholename 2012-11-28/8 -type f | wc -l'

Suffice it to say, this is both disconcerting and frustrating. Any ideas?

--

--

Hi Radu,

Thanks for your response. We've mostly been using 12 shards with 2
replicas, and we're currently using the local gateway with a
recover_after_time of 2 minutes:

 gateway.type: local
 gateway.recover_after_time: 2m

And yes, I hadn't noticed it before, but I am seeing a few lines about
dangling indexes. In particular, that it 'exists on local file system, but
not in cluster metadata' and that it's scheduled to be deleted in two
hours. Why would something like that happen?

On Friday, November 30, 2012 7:13:47 AM UTC-8, Radu Gheorghe wrote:

Hello Dan,

What's the number of shards and replicas?

And what's your configuration regarding recovery? I'm thinking about the
ones listed here:
http://www.elasticsearch.org/guide/reference/modules/gateway/

Also, in the logs do you see something about dangling indices that will be
deleted?

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Thu, Nov 29, 2012 at 7:28 PM, Dan Lecocq <d...@seomoz.org <javascript:>

wrote:

It's become a recurring problem that our ES cluster crashes, and then
when we bring it back up, shards are missing. We keep tweaking and tuning
parameters, but it's mostly grasping at straws and it continues to
happen. Currently we're running 0.19.9 on a 16-node cluster of m1.xlarge
instances, with RAID0 across the four ephemeral drives (giving each node
1.6TB storage). The heap size for each machine is about half (8GB) of the
available memory (15GB).

We're generally happy with the performance, except for, you know, keeping
it running and alive :-/ Nothing in the logs seems to be an indicator.
Originally we were having some memory issues, but it turns out the heap
size environment variable wasn't getting properly set and that has since
been fixed. We ran into issues about the system mlock limit being set too
low, but that has also since been fixed. What seems to usually happen is
that a node will be unable to ping the master node, and then suddenly a few
more nodes will bite the dust, too. Sometimes bouncing elasticsearch on the
troubled boxes suffices, and sometimes we have to bounce the entire cluster.

Once everything comes back, _cluster/health reports that there are
unassigned shards, and they just sit there like that indefinitely. I've
been trawling through the data directory, hoping to find some clues. For
the shards that sit around unassigned, there's actually no corresponding
index files on /any/ machine:

# Comes back '0' for all of our machines for missing shards
pssh -i --host=fresh-search-{11..26} -l ec2-user 'find $ESDIR 

-wholename 2012-11-28/7 -type f | wc -l'
# Comes back as non-zero on each machine that has a copy of the shard
for OK shards
pssh -i --host=fresh-search-{11..26} -l ec2-user 'find $ESDIR
-wholename 2012-11-28/8 -type f | wc -l'

Suffice it to say, this is both disconcerting and frustrating. Any ideas?

--

--

Hello Dan,

I would assume that in your case, on a full cluster restart, recovery
begins after 2 minutes, since the gateway.recover_after_nodes setting
defaults to 1. At that point, I would assume many nodes might not be
started, so the cluster is not aware of the existence of some indices.

Afterwards, when other nodes join, the missing indices are reported as
dangling. You can avoid that by setting a high value of
gateway.recover_after_nodes, to make sure you have all indices in that
number of nodes before you begin recovery. You might also want to increase
gateway.recover_after_time, to give time for other nodes to start. And
gateway.expected_nodes should be the total number of nodes in your cluster,
so 16.

Since 0.19.8, dangling indices should be automatically imported by default
(gateway.local.auto_import_dangled: yes), but obviously in your case it
isn't applied. So I'd suggest you explicitly specify that in your config,
to prevent such indices from being deleted in future.

Anyway, I would tweak the recovery settings and see if the problem still
appears.

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Tue, Dec 4, 2012 at 12:55 AM, Dan Lecocq dan@seomoz.org wrote:

Hi Radu,

Thanks for your response. We've mostly been using 12 shards with 2
replicas, and we're currently using the local gateway with a
recover_after_time of 2 minutes:

 gateway.type: local
 gateway.recover_after_time: 2m

And yes, I hadn't noticed it before, but I am seeing a few lines about
dangling indexes. In particular, that it 'exists on local file system, but
not in cluster metadata' and that it's scheduled to be deleted in two
hours. Why would something like that happen?

On Friday, November 30, 2012 7:13:47 AM UTC-8, Radu Gheorghe wrote:

Hello Dan,

What's the number of shards and replicas?

And what's your configuration regarding recovery? I'm thinking about the
ones listed here:
http://www.elasticsearch.org/**guide/reference/modules/**gateway/http://www.elasticsearch.org/guide/reference/modules/gateway/

Also, in the logs do you see something about dangling indices that will
be deleted?

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Thu, Nov 29, 2012 at 7:28 PM, Dan Lecocq d...@seomoz.org wrote:

It's become a recurring problem that our ES cluster crashes, and then
when we bring it back up, shards are missing. We keep tweaking and tuning
parameters, but it's mostly grasping at straws and it continues to
happen. Currently we're running 0.19.9 on a 16-node cluster of m1.xlarge
instances, with RAID0 across the four ephemeral drives (giving each node
1.6TB storage). The heap size for each machine is about half (8GB) of the
available memory (15GB).

We're generally happy with the performance, except for, you know,
keeping it running and alive :-/ Nothing in the logs seems to be an
indicator. Originally we were having some memory issues, but it turns out
the heap size environment variable wasn't getting properly set and that has
since been fixed. We ran into issues about the system mlock limit being set
too low, but that has also since been fixed. What seems to usually happen
is that a node will be unable to ping the master node, and then suddenly a
few more nodes will bite the dust, too. Sometimes bouncing elasticsearch on
the troubled boxes suffices, and sometimes we have to bounce the entire
cluster.

Once everything comes back, _cluster/health reports that there are
unassigned shards, and they just sit there like that indefinitely. I've
been trawling through the data directory, hoping to find some clues. For
the shards that sit around unassigned, there's actually no corresponding
index files on /any/ machine:

# Comes back '0' for all of our machines for missing shards
pssh -i --host=fresh-search-{11..26} -l ec2-user 'find $ESDIR

-wholename 2012-11-28/7 -type f | wc -l'
# Comes back as non-zero on each machine that has a copy of the
shard for OK shards
pssh -i --host=fresh-search-{11..26} -l ec2-user 'find $ESDIR
-wholename 2012-11-28/8 -type f | wc -l'

Suffice it to say, this is both disconcerting and frustrating. Any ideas?

--

--

--

Thanks,

I changed all those settings, and I'm keeping my fingers crossed!

On Tuesday, December 4, 2012 5:00:37 AM UTC-8, Radu Gheorghe wrote:

Hello Dan,

I would assume that in your case, on a full cluster restart, recovery
begins after 2 minutes, since the gateway.recover_after_nodes setting
defaults to 1. At that point, I would assume many nodes might not be
started, so the cluster is not aware of the existence of some indices.

Afterwards, when other nodes join, the missing indices are reported as
dangling. You can avoid that by setting a high value of
gateway.recover_after_nodes, to make sure you have all indices in that
number of nodes before you begin recovery. You might also want to increase
gateway.recover_after_time, to give time for other nodes to start. And
gateway.expected_nodes should be the total number of nodes in your cluster,
so 16.

Since 0.19.8, dangling indices should be automatically imported by default
(gateway.local.auto_import_dangled: yes), but obviously in your case it
isn't applied. So I'd suggest you explicitly specify that in your config,
to prevent such indices from being deleted in future.

Anyway, I would tweak the recovery settings and see if the problem still
appears.

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Tue, Dec 4, 2012 at 12:55 AM, Dan Lecocq <d...@seomoz.org <javascript:>

wrote:

Hi Radu,

Thanks for your response. We've mostly been using 12 shards with 2
replicas, and we're currently using the local gateway with a
recover_after_time of 2 minutes:

 gateway.type: local
 gateway.recover_after_time: 2m

And yes, I hadn't noticed it before, but I am seeing a few lines about
dangling indexes. In particular, that it 'exists on local file system, but
not in cluster metadata' and that it's scheduled to be deleted in two
hours. Why would something like that happen?

On Friday, November 30, 2012 7:13:47 AM UTC-8, Radu Gheorghe wrote:

Hello Dan,

What's the number of shards and replicas?

And what's your configuration regarding recovery? I'm thinking about the
ones listed here:
http://www.elasticsearch.org/**guide/reference/modules/**gateway/http://www.elasticsearch.org/guide/reference/modules/gateway/

Also, in the logs do you see something about dangling indices that will
be deleted?

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Thu, Nov 29, 2012 at 7:28 PM, Dan Lecocq d...@seomoz.org wrote:

It's become a recurring problem that our ES cluster crashes, and then
when we bring it back up, shards are missing. We keep tweaking and tuning
parameters, but it's mostly grasping at straws and it continues to
happen. Currently we're running 0.19.9 on a 16-node cluster of m1.xlarge
instances, with RAID0 across the four ephemeral drives (giving each node
1.6TB storage). The heap size for each machine is about half (8GB) of the
available memory (15GB).

We're generally happy with the performance, except for, you know,
keeping it running and alive :-/ Nothing in the logs seems to be an
indicator. Originally we were having some memory issues, but it turns out
the heap size environment variable wasn't getting properly set and that has
since been fixed. We ran into issues about the system mlock limit being set
too low, but that has also since been fixed. What seems to usually happen
is that a node will be unable to ping the master node, and then suddenly a
few more nodes will bite the dust, too. Sometimes bouncing elasticsearch on
the troubled boxes suffices, and sometimes we have to bounce the entire
cluster.

Once everything comes back, _cluster/health reports that there are
unassigned shards, and they just sit there like that indefinitely. I've
been trawling through the data directory, hoping to find some clues. For
the shards that sit around unassigned, there's actually no corresponding
index files on /any/ machine:

# Comes back '0' for all of our machines for missing shards
pssh -i --host=fresh-search-{11..26} -l ec2-user 'find $ESDIR 

-wholename 2012-11-28/7 -type f | wc -l'
# Comes back as non-zero on each machine that has a copy of the
shard for OK shards
pssh -i --host=fresh-search-{11..26} -l ec2-user 'find $ESDIR
-wholename 2012-11-28/8 -type f | wc -l'

Suffice it to say, this is both disconcerting and frustrating. Any
ideas?

--

--

--