Elasticsearch keeps going into red status, and thinking I have more nodes than I do


(David Reagan) #1

For some reason, Elasticsearch keeps ending up in red status, and I have to
restart to get it working again. As far as I can tell, restarting
elasticsearch doesn't loose any data. As I've Googled and troubleshooted my
way around, I ran into the fact that Elasticsearch thinks I have two nodes.
Which makes no sense, since I only ever built one, and I'm sure no one else
here at work has either.

Here's the list of nodes ES thinks I have:

curl -XGET 'http://localhost:9200/_cluster/nodes?pretty=true'
{
"ok" : true,
"cluster_name" : "logstash-webservices",
"nodes" : {
"uy9_TOOlQU2cavfcZ7NOUw" : {
"name" : "Kragoff, Ivan",
"transport_address" : "inet[/10.225.0.82:9300]",
"hostname" : "log-indexer-01",
"version" : "0.90.3",
"attributes" : {
"client" : "true",
"data" : "false"
}
},
"TwqGueloQtyRTrrEVF5H-A" : {
"name" : "Arabian Knight",
"transport_address" : "inet[/10.225.0.84:9300]",
"hostname" : "log-elasticsearch-01",
"version" : "0.90.3",
"http_address" : "inet[/10.225.0.84:9200]"
}
}
}

log-elasticsearch-01 is my elasticsearch server. log-indexer-01 is the
server running my central logstash instance. It does not have ES installed
on it.

I tried "curl -XPOST
'http://localhost:9200/_cluster/nodes/uy9_TOOlQU2cavfcZ7NOUw/_shutdown'",
but ES dropped back into red a couple times, I restarted a couple times,
and log-indexer-01 is showing up as a node again.

Is there something about logstash that registers it as a node? Could the
fact that the logstash indexer doesn't have ES on it be why ES keeps going
into red status?

Also, is there a good overview of ES written somewhere that would give me
the background knowledge I need to really understand the docs (
http://www.elasticsearch.org/guide/)?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Mark Walkom) #2

Logstash doesn't need an elasticsearch instance locally and it doesn't
register as a node. Can you run a ps on the other host to see what is
running there?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Reagan) #3

ps just gives a few items...

ps
PID TTY TIME CMD
406 pts/0 00:00:00 ps
31746 pts/0 00:00:00 sudo
31747 pts/0 00:00:00 su
31754 pts/0 00:00:00 bash

The only things running on the logstash server are, logstash, postfix,
fail2ban, nfs_automount, ssh, and basic linux system stuff.

Same on the ES server, just logstash is replaced by ES, and we have Apache2
for Kibana.

Would ES potentially choke over using an nfs mount for the data dir?

--David Reagan

On Thu, Oct 31, 2013 at 2:39 PM, Mark Walkom markw@campaignmonitor.comwrote:

Logstash doesn't need an elasticsearch instance locally and it doesn't
register as a node. Can you run a ps on the other host to see what is
running there?

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/AlvKlXJtct4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Reagan) #4

If it helps, I just did a restart of ES while tailing the
/var/log/elasticsearch/logstash-webservices.log file. I got:

[2013-10-31 16:08:24,645][INFO ][node ] [Tinkerer]

stopping ...
[2013-10-31 16:08:24,740][INFO ][node ] [Tinkerer]
stopped
[2013-10-31 16:08:24,740][INFO ][node ] [Tinkerer]
closing ...
[2013-10-31 16:08:24,747][INFO ][node ] [Tinkerer]
closed
[2013-10-31 16:08:26,322][INFO ][node ] [Sise-Neg]
version[0.90.3], pid[4909], build[5c38d60/2013-08-06T13:18:31Z]
[2013-10-31 16:08:26,323][INFO ][node ] [Sise-Neg]
initializing ...
[2013-10-31 16:08:26,329][INFO ][plugins ] [Sise-Neg]
loaded [], sites []
[2013-10-31 16:08:28,507][INFO ][node ] [Sise-Neg]
initialized
[2013-10-31 16:08:28,507][INFO ][node ] [Sise-Neg]
starting ...
[2013-10-31 16:08:28,595][INFO ][transport ] [Sise-Neg]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address
{inet[/10.225.0.84:9300]}
[2013-10-31 16:08:31,654][INFO ][cluster.service ] [Sise-Neg]
new_master [Sise-Neg][pAMOAsnRSiuntvE8nZkq6A][inet[/10.225.0.84:9300]],
reason: zen-disco-join (elected_as_master)
[2013-10-31 16:08:31,667][INFO ][discovery ] [Sise-Neg]
logstash-webservices/pAMOAsnRSiuntvE8nZkq6A
[2013-10-31 16:08:31,688][INFO ][http ] [Sise-Neg]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address
{inet[/10.225.0.84:9200]}
[2013-10-31 16:08:31,688][INFO ][node ] [Sise-Neg]
started
[2013-10-31 16:08:32,227][INFO ][gateway ] [Sise-Neg]
recovered [3] indices into cluster_state
[2013-10-31 16:08:33,769][INFO ][cluster.service ] [Sise-Neg]
added
{[Thunderbolt][rUiIhT1uSS-6u0sRM2lbTA][inet[/10.225.0.82:9300]]{client=true,
data=false},}, reason: zen-disco-receive(join from
node[[Thunderbolt][rUiIhT1uSS-6u0sRM2lbTA][inet[/10.225.0.82:9300]]{client=true,
data=false}])
[2013-10-31 16:08:46,830][WARN ][index.shard.service ] [Sise-Neg]
[logstash-2013.10.31][1] Failed to perform scheduled engine refresh
org.elasticsearch.index.engine.RefreshFailedEngineException:
[logstash-2013.10.31][1] Refresh failed
at
org.elasticsearch.index.engine.robin.RobinEngine.refresh(RobinEngine.java:796)
at
org.elasticsearch.index.shard.service.InternalIndexShard.refresh(InternalIndexShard.java:414)
at
org.elasticsearch.index.shard.service.InternalIndexShard$EngineRefresher$1.run(InternalIndexShard.java:757)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.io.FileNotFoundException:
/mounts/ws-data/log-data/logstash-webservices/nodes/0/indices/logstash-2013.10.31/1/index/_2cs.cfs
(No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:233)
at
org.apache.lucene.store.NIOFSDirectory.createSlicer(NIOFSDirectory.java:88)
at
org.apache.lucene.store.RateLimitedFSDirectory.createSlicer(RateLimitedFSDirectory.java:111)
at
org.elasticsearch.index.store.Store$StoreDirectory.createSlicer(Store.java:459)
at
org.apache.lucene.store.CompoundFileDirectory.(CompoundFileDirectory.java:102)
at
org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:116)
at org.apache.lucene.index.SegmentReader.(SegmentReader.java:56)
at
org.apache.lucene.index.ReadersAndLiveDocs.getReader(ReadersAndLiveDocs.java:121)
at
org.apache.lucene.index.ReadersAndLiveDocs.getReadOnlyClone(ReadersAndLiveDocs.java:218)
at
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:100)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:377)
at
org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:275)
at
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:250)
at
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:240)
at
org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170)
at
org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:118)
at
org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
at
org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:155)
at
org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:204)
at
org.elasticsearch.index.engine.robin.RobinEngine.refresh(RobinEngine.java:777)
... 5 more
[2013-10-31 16:08:47,836][WARN ][index.shard.service ] [Sise-Neg]
[logstash-2013.10.31][1] Failed to perform scheduled engine refresh
org.elasticsearch.index.engine.RefreshFailedEngineException:
[logstash-2013.10.31][1] Refresh failed
at
org.elasticsearch.index.engine.robin.RobinEngine.refresh(RobinEngine.java:796)
at
org.elasticsearch.index.shard.service.InternalIndexShard.refresh(InternalIndexShard.java:414)
at
org.elasticsearch.index.shard.service.InternalIndexShard$EngineRefresher$1.run(InternalIndexShard.java:757)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.io.FileNotFoundException: _2ct.fdt
at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:266)
at
org.apache.lucene.store.RateLimitedFSDirectory.fileLength(RateLimitedFSDirectory.java:65)
at
org.elasticsearch.index.store.Store$StoreIndexOutput.close(Store.java:563)
at org.apache.lucene.util.IOUtils.close(IOUtils.java:146)
at
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.close(CompressingStoredFieldsWriter.java:135)
at org.apache.lucene.util.IOUtils.close(IOUtils.java:146)
at
org.apache.lucene.index.StoredFieldsProcessor.flush(StoredFieldsProcessor.java:78)
at
org.apache.lucene.index.TwoStoredFieldsConsumers.flush(TwoStoredFieldsConsumers.java:41)
at
org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:80)
at
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:501)
at
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:478)
at
org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:615)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:365)
at
org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:275)
at
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:250)
at
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:240)
at
org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170)
at
org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:118)
at
org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
at
org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:155)
at
org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:204)
at
org.elasticsearch.index.engine.robin.RobinEngine.refresh(RobinEngine.java:777)
... 5 more
Suppressed: java.io.FileNotFoundException: _2ct.fdx
at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:266)
at
org.apache.lucene.store.RateLimitedFSDirectory.fileLength(RateLimitedFSDirectory.java:65)
at
org.elasticsearch.index.store.Store$StoreIndexOutput.close(Store.java:563)
at
org.apache.lucene.codecs.compressing.CompressingStoredFieldsIndexWriter.close(CompressingStoredFieldsIndexWriter.java:206)
... 24 more

On Thursday, October 31, 2013 2:39:08 PM UTC-7, Mark Walkom wrote:

Logstash doesn't need an elasticsearch instance locally and it doesn't
register as a node. Can you run a ps on the other host to see what is
running there?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Reagan) #5

I figured out the logstash node. It's a data node, or it isn't a data node.
Checking the health shows 2 nodes total, and 1 data node. So I can safely
ignore the logstash node.

That said, I'm still having issues with elasticsearch dropping into red
status. I've attached today's log file.

I've tried upping the number of files the system can have open, increased
the RAM available to the VM, and modified the init script to set the java
heap size to 3G. That all seems to have increased the time it takes for
elasticsearch to go down, but it still goes down.

Most of the errors seem like they have something to do with not finding
files. I have my data directory on an NFS version 3 share. Could NFS have
something to do with this?

One of my next steps is to store the data on the VM's hard drive, we'll see
if that helps.

Meanwhile, do any of you have any suggestions?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Mark Walkom) #6

You're better off putting logs/configs etc in a gist rather than
distributing it to the hundreds/thousands of people on list.

What time in the logs does ES fail? I can see a lot of "(No such file or
directory)" errors later in the file, which indicates you're losing your
NFS mount.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 6 November 2013 08:21, David Reagan jerrac@gmail.com wrote:

I figured out the logstash node. It's a data node, or it isn't a data
node. Checking the health shows 2 nodes total, and 1 data node. So I can
safely ignore the logstash node.

That said, I'm still having issues with elasticsearch dropping into red
status. I've attached today's log file.

I've tried upping the number of files the system can have open, increased
the RAM available to the VM, and modified the init script to set the java
heap size to 3G. That all seems to have increased the time it takes for
elasticsearch to go down, but it still goes down.

Most of the errors seem like they have something to do with not finding
files. I have my data directory on an NFS version 3 share. Could NFS have
something to do with this?

One of my next steps is to store the data on the VM's hard drive, we'll
see if that helps.

Meanwhile, do any of you have any suggestions?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Reagan) #7

You're right, should have used a gist. Oops... I forgot this was a mailing
list, not a forum.

The latest failure is at the very bottom of the file. I didn't think to
make note of previous ones. :\ I'll do so in the future.

An interesting thing I just found, after unmounting my nfs share, was a
bunch of elasticsearch files sitting on the local hard drive that should
not have been there. Judging from the dates on the files, it looks like
elasticsearch was writing to both the nfs mount, and the local directory
the mount was supposed to be on. Like the mount was unmounting temporarily,
or something like that.

Which is possible since I'm using nfs_automount so that I don't have to
have my mount points in the fstab file...

So I cleaned everything out, and just have it writing to local disk only
right now. We'll see what happens.

--David Reagan

On Tue, Nov 5, 2013 at 1:25 PM, Mark Walkom markw@campaignmonitor.comwrote:

You're better off putting logs/configs etc in a gist rather than
distributing it to the hundreds/thousands of people on list.

What time in the logs does ES fail? I can see a lot of "(No such file or
directory)" errors later in the file, which indicates you're losing your
NFS mount.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 6 November 2013 08:21, David Reagan jerrac@gmail.com wrote:

I figured out the logstash node. It's a data node, or it isn't a data
node. Checking the health shows 2 nodes total, and 1 data node. So I can
safely ignore the logstash node.

That said, I'm still having issues with elasticsearch dropping into red
status. I've attached today's log file.

I've tried upping the number of files the system can have open, increased
the RAM available to the VM, and modified the init script to set the java
heap size to 3G. That all seems to have increased the time it takes for
elasticsearch to go down, but it still goes down.

Most of the errors seem like they have something to do with not finding
files. I have my data directory on an NFS version 3 share. Could NFS have
something to do with this?

One of my next steps is to store the data on the VM's hard drive, we'll
see if that helps.

Meanwhile, do any of you have any suggestions?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/AlvKlXJtct4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Reagan) #8

And I've confirmed that the problem is nfs_automount (
http://my.galagzee.com/2013/07/13/nfs-automount-the-fourth-iteration/). For
some reason, it thinks it's ok to remount the nfs share every few minutes.
I turned the service off, and Elasticsearch did not go into red status
overnight, and no errors showed up in the log.

Thanks Mark, and the rest of the list for being my duck. (
http://en.wikipedia.org/wiki/Rubber_duck_debugging) :slight_smile: Hopefully if someone
else who uses nfs_automount runs into a similar problem, these posts will
help.

--David Reagan

On Tue, Nov 5, 2013 at 1:34 PM, David Reagan jerrac@gmail.com wrote:

You're right, should have used a gist. Oops... I forgot this was a mailing
list, not a forum.

The latest failure is at the very bottom of the file. I didn't think to
make note of previous ones. :\ I'll do so in the future.

An interesting thing I just found, after unmounting my nfs share, was a
bunch of elasticsearch files sitting on the local hard drive that should
not have been there. Judging from the dates on the files, it looks like
elasticsearch was writing to both the nfs mount, and the local directory
the mount was supposed to be on. Like the mount was unmounting temporarily,
or something like that.

Which is possible since I'm using nfs_automount so that I don't have to
have my mount points in the fstab file...

So I cleaned everything out, and just have it writing to local disk only
right now. We'll see what happens.

--David Reagan

On Tue, Nov 5, 2013 at 1:25 PM, Mark Walkom markw@campaignmonitor.comwrote:

You're better off putting logs/configs etc in a gist rather than
distributing it to the hundreds/thousands of people on list.

What time in the logs does ES fail? I can see a lot of "(No such file or
directory)" errors later in the file, which indicates you're losing your
NFS mount.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 6 November 2013 08:21, David Reagan jerrac@gmail.com wrote:

I figured out the logstash node. It's a data node, or it isn't a data
node. Checking the health shows 2 nodes total, and 1 data node. So I can
safely ignore the logstash node.

That said, I'm still having issues with elasticsearch dropping into red
status. I've attached today's log file.

I've tried upping the number of files the system can have open,
increased the RAM available to the VM, and modified the init script to set
the java heap size to 3G. That all seems to have increased the time it
takes for elasticsearch to go down, but it still goes down.

Most of the errors seem like they have something to do with not finding
files. I have my data directory on an NFS version 3 share. Could NFS have
something to do with this?

One of my next steps is to store the data on the VM's hard drive, we'll
see if that helps.

Meanwhile, do any of you have any suggestions?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/AlvKlXJtct4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #9