ElasticSearch with > 40 nodes, missing shards and indexing troubles

Jerome_Gagnon · December 10, 2012, 8:54pm

Hi everyone,

I'm currently on a project that require using more than 40 nodes..
(migration from Sphinx cluster to ES) and I'm actually having hard time
managing everything. I'm able to do the bulk indexation of everything
(6+Billions documents) into the cluster with wonderdog project from
infochimps.

After that I turn on the refresh rate to 60s and merge factor to 10. Turn
on the replicate to 1 (which takes a lot of time to complete btw). But then
I try to index data from Elastica (PHP Client) and get a lots
of "UnavailableShardsException[[index][21] [1] shardIt, [0] active :
Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@6a7604b7]" Errors... I
didn't know what that meant, so I proceed to a full cluster restart (just
in case someone was broken) then when the cluster rebooted, 33shards were
missing from the cluster... Does somebody have an idea about what might
have hapenned ? Moreover, I use bulk (size 100) but I can't seem to be able
to index at 2K/doc/sec... I think that "indexer-only" nodes would be great
for my case but this isn't possible to do so in ES or I'm a wrong ?

I'm kind of depressed right now because that's the 2nd time that something
like that happens to me... does anybody have every managed a ES cluster
with this size ? I'm using tmux ssh to manage servers, but it's kind of a
pain. I'm using Elasticsearch-Head and Bigdesk to monitor my cluster health.

Other thing, It seems that I have a lot of timeout between the nodes... I
switched from Multicast to unicast to try to fix that, but the cluster
recovered in the state I told you (33 shards missing...).

Moreover, in the small scale benchmarks I ran... I had about 130M doc by
node to have the response time I wanted... but with the same amount of
documents in the full clusters I was nowhere that in response time ? So I'm
wondering what is happening..

Sorry for the long rant messages, but I'm on the edge for my projet right
now and don't know what to do anymore...

--

otisg · December 11, 2012, 3:18am

Hi,

Sounds fun!
Anything worth sharing in the logs (from any of those 40 servers, hah)
worth sharing from before the exception?

You only mentioned indexing. But did search work after you've indexed your
6B docs?
Actually you do mentioned search in the end, but only in the context of
high latency. But it works without exceptions, right?

You say you increase replication after initial indexing and then get errors
after trying to index some more. Have you also tried indexing some more
before increasing replication?

Otis

ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Monday, December 10, 2012 3:54:16 PM UTC-5, Jérôme Gagnon wrote:

Hi everyone,

I'm currently on a project that require using more than 40 nodes..
(migration from Sphinx cluster to ES) and I'm actually having hard time
managing everything. I'm able to do the bulk indexation of everything
(6+Billions documents) into the cluster with wonderdog project from
infochimps.

After that I turn on the refresh rate to 60s and merge factor to 10. Turn
on the replicate to 1 (which takes a lot of time to complete btw). But then
I try to index data from Elastica (PHP Client) and get a lots
of "UnavailableShardsException[[index][21] [1] shardIt, [0] active :
Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@6a7604b7]" Errors... I
didn't know what that meant, so I proceed to a full cluster restart (just
in case someone was broken) then when the cluster rebooted, 33shards were
missing from the cluster... Does somebody have an idea about what might
have hapenned ? Moreover, I use bulk (size 100) but I can't seem to be able
to index at 2K/doc/sec... I think that "indexer-only" nodes would be great
for my case but this isn't possible to do so in ES or I'm a wrong ?

I'm kind of depressed right now because that's the 2nd time that something
like that happens to me... does anybody have every managed a ES cluster
with this size ? I'm using tmux ssh to manage servers, but it's kind of a
pain. I'm using Elasticsearch-Head and Bigdesk to monitor my cluster health.

Other thing, It seems that I have a lot of timeout between the nodes... I
switched from Multicast to unicast to try to fix that, but the cluster
recovered in the state I told you (33 shards missing...).

Moreover, in the small scale benchmarks I ran... I had about 130M doc by
node to have the response time I wanted... but with the same amount of
documents in the full clusters I was nowhere that in response time ? So I'm
wondering what is happening..

Sorry for the long rant messages, but I'm on the edge for my projet right
now and don't know what to do anymore...

--

Jerome_Gagnon · December 11, 2012, 2:12pm

Lots and lots of fun yes !

I did some searches, were able to have decent results (one the cache is
warmed up adequately, which takes some amount of time with 40+ nodes) And
it worked without any exceptions.

The way I did it is tried indexing without replicas... then got
the UnavailableShardsException like I already told. Saw on the logs the
timeout... tried to put 1 to replications factor, and then restarted
cluster, then the troubles cames.

From the logs got a lot of those between the servers (the one that shards
were told missing)

org.elasticsearch.transport.ConnectTransportException:
[es11b][inet[/10.1.16.153:9300]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:674)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:604)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:574)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:127)
at
org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing$Receiver$1.run(MulticastZenPing.java:536)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:490)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:446)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:359)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more

On Monday, December 10, 2012 10:18:37 PM UTC-5, Otis Gospodnetic wrote:

Hi,

Sounds fun!
Anything worth sharing in the logs (from any of those 40 servers, hah)
worth sharing from before the exception?

You only mentioned indexing. But did search work after you've indexed
your 6B docs?
Actually you do mentioned search in the end, but only in the context of
high latency. But it works without exceptions, right?

You say you increase replication after initial indexing and then get
errors after trying to index some more. Have you also tried indexing some
more before increasing replication?

Otis

ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Monday, December 10, 2012 3:54:16 PM UTC-5, Jérôme Gagnon wrote:

Hi everyone,

I'm currently on a project that require using more than 40 nodes..
(migration from Sphinx cluster to ES) and I'm actually having hard time
managing everything. I'm able to do the bulk indexation of everything
(6+Billions documents) into the cluster with wonderdog project from
infochimps.

After that I turn on the refresh rate to 60s and merge factor to 10. Turn
on the replicate to 1 (which takes a lot of time to complete btw). But then
I try to index data from Elastica (PHP Client) and get a lots
of "UnavailableShardsException[[index][21] [1] shardIt, [0] active :
Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@6a7604b7]" Errors... I
didn't know what that meant, so I proceed to a full cluster restart (just
in case someone was broken) then when the cluster rebooted, 33shards were
missing from the cluster... Does somebody have an idea about what might
have hapenned ? Moreover, I use bulk (size 100) but I can't seem to be able
to index at 2K/doc/sec... I think that "indexer-only" nodes would be great
for my case but this isn't possible to do so in ES or I'm a wrong ?

I'm kind of depressed right now because that's the 2nd time that
something like that happens to me... does anybody have every managed a ES
cluster with this size ? I'm using tmux ssh to manage servers, but it's
kind of a pain. I'm using Elasticsearch-Head and Bigdesk to monitor my
cluster health.

Other thing, It seems that I have a lot of timeout between the nodes... I
switched from Multicast to unicast to try to fix that, but the cluster
recovered in the state I told you (33 shards missing...).

Moreover, in the small scale benchmarks I ran... I had about 130M doc by
node to have the response time I wanted... but with the same amount of
documents in the full clusters I was nowhere that in response time ? So I'm
wondering what is happening..

Sorry for the long rant messages, but I'm on the edge for my projet right
now and don't know what to do anymore...

--

Jerome_Gagnon · December 11, 2012, 2:18pm

Oh, Also just did some more log digging and saw this error on one of the
machine ?

2012-12-10 00:00:00,120][WARN ][cluster.action.shard ] [es54b] sending
failed shard for [type][7], node[ZOaEcxL5RsGsOFn3TovTxg], relocating
[0wpatztbQaOroGbOCAXyYA], [P], s[INITIALIZING], reason [Failed to create
shard, message [IndexShardCreationException[[type][7] failed to create
shard]; nested: IOException[directory
'/data/index/cluster_es1b/nodes/0/indices/name/7/index' exists and is a
directory, but cannot be listed: list() returned null]; ]]

That can't be good ..

On Tuesday, December 11, 2012 9:12:36 AM UTC-5, Jérôme Gagnon wrote:

Lots and lots of fun yes !

I did some searches, were able to have decent results (one the cache is
warmed up adequately, which takes some amount of time with 40+ nodes) And
it worked without any exceptions.

The way I did it is tried indexing without replicas... then got
the UnavailableShardsException like I already told. Saw on the logs the
timeout... tried to put 1 to replications factor, and then restarted
cluster, then the troubles cames.

From the logs got a lot of those between the servers (the one that shards
were told missing)

org.elasticsearch.transport.ConnectTransportException: [es11b][inet[/
10.1.16.153:9300]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:674)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:604)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:574)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:127)
at
org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing$Receiver$1.run(MulticastZenPing.java:536)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:490)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:446)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:359)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more

On Monday, December 10, 2012 10:18:37 PM UTC-5, Otis Gospodnetic wrote:

Hi,

Sounds fun!
Anything worth sharing in the logs (from any of those 40 servers, hah)
worth sharing from before the exception?

You only mentioned indexing. But did search work after you've indexed
your 6B docs?
Actually you do mentioned search in the end, but only in the context of
high latency. But it works without exceptions, right?

You say you increase replication after initial indexing and then get
errors after trying to index some more. Have you also tried indexing some
more before increasing replication?

Otis

ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Monday, December 10, 2012 3:54:16 PM UTC-5, Jérôme Gagnon wrote:

Hi everyone,

I'm currently on a project that require using more than 40 nodes..
(migration from Sphinx cluster to ES) and I'm actually having hard time
managing everything. I'm able to do the bulk indexation of everything
(6+Billions documents) into the cluster with wonderdog project from
infochimps.

After that I turn on the refresh rate to 60s and merge factor to 10.
Turn on the replicate to 1 (which takes a lot of time to complete btw). But
then I try to index data from Elastica (PHP Client) and get a lots
of "UnavailableShardsException[[index][21] [1] shardIt, [0] active :
Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@6a7604b7]" Errors... I
didn't know what that meant, so I proceed to a full cluster restart (just
in case someone was broken) then when the cluster rebooted, 33shards were
missing from the cluster... Does somebody have an idea about what might
have hapenned ? Moreover, I use bulk (size 100) but I can't seem to be able
to index at 2K/doc/sec... I think that "indexer-only" nodes would be great
for my case but this isn't possible to do so in ES or I'm a wrong ?

I'm kind of depressed right now because that's the 2nd time that
something like that happens to me... does anybody have every managed a ES
cluster with this size ? I'm using tmux ssh to manage servers, but it's
kind of a pain. I'm using Elasticsearch-Head and Bigdesk to monitor my
cluster health.

Other thing, It seems that I have a lot of timeout between the nodes...
I switched from Multicast to unicast to try to fix that, but the cluster
recovered in the state I told you (33 shards missing...).

Moreover, in the small scale benchmarks I ran... I had about 130M doc by
node to have the response time I wanted... but with the same amount of
documents in the full clusters I was nowhere that in response time ? So I'm
wondering what is happening..

Sorry for the long rant messages, but I'm on the edge for my projet
right now and don't know what to do anymore...

--

Andy_Wick · December 11, 2012, 3:46pm

That almost sounds like a permission issue, as the user that runs
elasticsearch can you do ls
-l '/data/index/cluster_es1b/nodes/0/indices/name/7/index'

Also maybe I missed it but what version of elasticsearch are you using,
later versions (I think 0.19.10) fixed many of our large cluster issues.

Thanks,
Andy

On Tuesday, December 11, 2012 9:18:22 AM UTC-5, Jérôme Gagnon wrote:

Oh, Also just did some more log digging and saw this error on one of the
machine ?

2012-12-10 00:00:00,120][WARN ][cluster.action.shard ] [es54b] sending
failed shard for [type][7], node[ZOaEcxL5RsGsOFn3TovTxg], relocating
[0wpatztbQaOroGbOCAXyYA], [P], s[INITIALIZING], reason [Failed to create
shard, message [IndexShardCreationException[[type][7] failed to create
shard]; nested: IOException[directory
'/data/index/cluster_es1b/nodes/0/indices/name/7/index' exists and is a
directory, but cannot be listed: list() returned null]; ]]

That can't be good ..

On Tuesday, December 11, 2012 9:12:36 AM UTC-5, Jérôme Gagnon wrote:

Lots and lots of fun yes !

I did some searches, were able to have decent results (one the cache is
warmed up adequately, which takes some amount of time with 40+ nodes) And
it worked without any exceptions.

The way I did it is tried indexing without replicas... then got
the UnavailableShardsException like I already told. Saw on the logs the
timeout... tried to put 1 to replications factor, and then restarted
cluster, then the troubles cames.

From the logs got a lot of those between the servers (the one that shards
were told missing)

org.elasticsearch.transport.ConnectTransportException: [es11b][inet[/
10.1.16.153:9300]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:674)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:604)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:574)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:127)
at
org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing$Receiver$1.run(MulticastZenPing.java:536)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:490)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:446)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:359)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more

On Monday, December 10, 2012 10:18:37 PM UTC-5, Otis Gospodnetic wrote:

Hi,

Sounds fun!
Anything worth sharing in the logs (from any of those 40 servers, hah)
worth sharing from before the exception?

You only mentioned indexing. But did search work after you've indexed
your 6B docs?
Actually you do mentioned search in the end, but only in the context of
high latency. But it works without exceptions, right?

You say you increase replication after initial indexing and then get
errors after trying to index some more. Have you also tried indexing some
more before increasing replication?

Otis

ELASTICSEARCH Performance Monitoring -
Sematext Monitoring | Infrastructure Monitoring Service

On Monday, December 10, 2012 3:54:16 PM UTC-5, Jérôme Gagnon wrote:

Hi everyone,

I'm currently on a project that require using more than 40 nodes..
(migration from Sphinx cluster to ES) and I'm actually having hard time
managing everything. I'm able to do the bulk indexation of everything
(6+Billions documents) into the cluster with wonderdog project from
infochimps.

After that I turn on the refresh rate to 60s and merge factor to 10.
Turn on the replicate to 1 (which takes a lot of time to complete btw). But
then I try to index data from Elastica (PHP Client) and get a lots
of "UnavailableShardsException[[index][21] [1] shardIt, [0] active :
Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@6a7604b7]" Errors... I
didn't know what that meant, so I proceed to a full cluster restart (just
in case someone was broken) then when the cluster rebooted, 33shards were
missing from the cluster... Does somebody have an idea about what might
have hapenned ? Moreover, I use bulk (size 100) but I can't seem to be able
to index at 2K/doc/sec... I think that "indexer-only" nodes would be great
for my case but this isn't possible to do so in ES or I'm a wrong ?

I'm kind of depressed right now because that's the 2nd time that
something like that happens to me... does anybody have every managed a ES
cluster with this size ? I'm using tmux ssh to manage servers, but it's
kind of a pain. I'm using Elasticsearch-Head and Bigdesk to monitor my
cluster health.

Other thing, It seems that I have a lot of timeout between the nodes...
I switched from Multicast to unicast to try to fix that, but the cluster
recovered in the state I told you (33 shards missing...).

Moreover, in the small scale benchmarks I ran... I had about 130M doc
by node to have the response time I wanted... but with the same amount of
documents in the full clusters I was nowhere that in response time ? So I'm
wondering what is happening..

Sorry for the long rant messages, but I'm on the edge for my projet
right now and don't know what to do anymore...

--

Jerome_Gagnon · December 11, 2012, 3:54pm

I'm running 0.20.0-RC1

Many thanks

I could do the ls... but I erased the index and started bulk indexation
again, but next time it happens (if it happens), I'll sure check this, it
might be possible that I have missed that (though I'm pretty sure I
double/triple watched this) Already happened to me... but it's good to keep
in mind.

On Tuesday, December 11, 2012 10:46:42 AM UTC-5, Andy Wick wrote:

That almost sounds like a permission issue, as the user that runs
elasticsearch can you do ls
-l '/data/index/cluster_es1b/nodes/0/indices/name/7/index'

Also maybe I missed it but what version of elasticsearch are you using,
later versions (I think 0.19.10) fixed many of our large cluster issues.

Thanks,
Andy

On Tuesday, December 11, 2012 9:18:22 AM UTC-5, Jérôme Gagnon wrote:

Oh, Also just did some more log digging and saw this error on one of the
machine ?

2012-12-10 00:00:00,120][WARN ][cluster.action.shard ] [es54b]
sending failed shard for [type][7], node[ZOaEcxL5RsGsOFn3TovTxg],
relocating [0wpatztbQaOroGbOCAXyYA], [P], s[INITIALIZING], reason [Failed
to create shard, message [IndexShardCreationException[[type][7] failed to
create shard]; nested: IOException[directory
'/data/index/cluster_es1b/nodes/0/indices/name/7/index' exists and is a
directory, but cannot be listed: list() returned null]; ]]

That can't be good ..

On Tuesday, December 11, 2012 9:12:36 AM UTC-5, Jérôme Gagnon wrote:

Lots and lots of fun yes !

I did some searches, were able to have decent results (one the cache is
warmed up adequately, which takes some amount of time with 40+ nodes) And
it worked without any exceptions.

The way I did it is tried indexing without replicas... then got
the UnavailableShardsException like I already told. Saw on the logs the
timeout... tried to put 1 to replications factor, and then restarted
cluster, then the troubles cames.

From the logs got a lot of those between the servers (the one that
shards were told missing)

org.elasticsearch.transport.ConnectTransportException: [es11b][inet[/
10.1.16.153:9300]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:674)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:604)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:574)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:127)
at
org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing$Receiver$1.run(MulticastZenPing.java:536)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:490)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:446)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:359)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more

On Monday, December 10, 2012 10:18:37 PM UTC-5, Otis Gospodnetic wrote:

Hi,

Sounds fun!
Anything worth sharing in the logs (from any of those 40 servers, hah)
worth sharing from before the exception?

You only mentioned indexing. But did search work after you've indexed
your 6B docs?
Actually you do mentioned search in the end, but only in the context of
high latency. But it works without exceptions, right?

You say you increase replication after initial indexing and then get
errors after trying to index some more. Have you also tried indexing some
more before increasing replication?

Otis

ELASTICSEARCH Performance Monitoring -
Sematext Monitoring | Infrastructure Monitoring Service

On Monday, December 10, 2012 3:54:16 PM UTC-5, Jérôme Gagnon wrote:

Hi everyone,

I'm currently on a project that require using more than 40 nodes..
(migration from Sphinx cluster to ES) and I'm actually having hard time
managing everything. I'm able to do the bulk indexation of everything
(6+Billions documents) into the cluster with wonderdog project from
infochimps.

After that I turn on the refresh rate to 60s and merge factor to 10.
Turn on the replicate to 1 (which takes a lot of time to complete btw). But
then I try to index data from Elastica (PHP Client) and get a lots
of "UnavailableShardsException[[index][21] [1] shardIt, [0] active :
Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@6a7604b7]" Errors... I
didn't know what that meant, so I proceed to a full cluster restart (just
in case someone was broken) then when the cluster rebooted, 33shards were
missing from the cluster... Does somebody have an idea about what might
have hapenned ? Moreover, I use bulk (size 100) but I can't seem to be able
to index at 2K/doc/sec... I think that "indexer-only" nodes would be great
for my case but this isn't possible to do so in ES or I'm a wrong ?

I'm kind of depressed right now because that's the 2nd time that
something like that happens to me... does anybody have every managed a ES
cluster with this size ? I'm using tmux ssh to manage servers, but it's
kind of a pain. I'm using Elasticsearch-Head and Bigdesk to monitor my
cluster health.

Other thing, It seems that I have a lot of timeout between the
nodes... I switched from Multicast to unicast to try to fix that, but the
cluster recovered in the state I told you (33 shards missing...).

Moreover, in the small scale benchmarks I ran... I had about 130M doc
by node to have the response time I wanted... but with the same amount of
documents in the full clusters I was nowhere that in response time ? So I'm
wondering what is happening..

Sorry for the long rant messages, but I'm on the edge for my projet
right now and don't know what to do anymore...

--

Jerome_Gagnon · December 11, 2012, 3:59pm

But now that i think about it, the missing shards weren't on the filesystem
either.. so I don't think it's a permission issue

On Tuesday, December 11, 2012 10:54:25 AM UTC-5, Jérôme Gagnon wrote:

I'm running 0.20.0-RC1

Many thanks

I could do the ls... but I erased the index and started bulk indexation
again, but next time it happens (if it happens), I'll sure check this, it
might be possible that I have missed that (though I'm pretty sure I
double/triple watched this) Already happened to me... but it's good to keep
in mind.

On Tuesday, December 11, 2012 10:46:42 AM UTC-5, Andy Wick wrote:

That almost sounds like a permission issue, as the user that runs
elasticsearch can you do ls
-l '/data/index/cluster_es1b/nodes/0/indices/name/7/index'

Also maybe I missed it but what version of elasticsearch are you using,
later versions (I think 0.19.10) fixed many of our large cluster issues.

Thanks,
Andy

On Tuesday, December 11, 2012 9:18:22 AM UTC-5, Jérôme Gagnon wrote:

Oh, Also just did some more log digging and saw this error on one of the
machine ?

2012-12-10 00:00:00,120][WARN ][cluster.action.shard ] [es54b]
sending failed shard for [type][7], node[ZOaEcxL5RsGsOFn3TovTxg],
relocating [0wpatztbQaOroGbOCAXyYA], [P], s[INITIALIZING], reason [Failed
to create shard, message [IndexShardCreationException[[type][7] failed to
create shard]; nested: IOException[directory
'/data/index/cluster_es1b/nodes/0/indices/name/7/index' exists and is a
directory, but cannot be listed: list() returned null]; ]]

That can't be good ..

On Tuesday, December 11, 2012 9:12:36 AM UTC-5, Jérôme Gagnon wrote:

Lots and lots of fun yes !

I did some searches, were able to have decent results (one the cache is
warmed up adequately, which takes some amount of time with 40+ nodes) And
it worked without any exceptions.

The way I did it is tried indexing without replicas... then got
the UnavailableShardsException like I already told. Saw on the logs the
timeout... tried to put 1 to replications factor, and then restarted
cluster, then the troubles cames.

From the logs got a lot of those between the servers (the one that
shards were told missing)

org.elasticsearch.transport.ConnectTransportException: [es11b][inet[/
10.1.16.153:9300]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:674)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:604)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:574)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:127)
at
org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing$Receiver$1.run(MulticastZenPing.java:536)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:490)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:446)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:359)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more

On Monday, December 10, 2012 10:18:37 PM UTC-5, Otis Gospodnetic wrote:

Hi,

Sounds fun!
Anything worth sharing in the logs (from any of those 40 servers, hah)
worth sharing from before the exception?

You only mentioned indexing. But did search work after you've indexed
your 6B docs?
Actually you do mentioned search in the end, but only in the context
of high latency. But it works without exceptions, right?

You say you increase replication after initial indexing and then get
errors after trying to index some more. Have you also tried indexing some
more before increasing replication?

Otis

ELASTICSEARCH Performance Monitoring -
Sematext Monitoring | Infrastructure Monitoring Service

On Monday, December 10, 2012 3:54:16 PM UTC-5, Jérôme Gagnon wrote:

Hi everyone,

I'm currently on a project that require using more than 40 nodes..
(migration from Sphinx cluster to ES) and I'm actually having hard time
managing everything. I'm able to do the bulk indexation of everything
(6+Billions documents) into the cluster with wonderdog project from
infochimps.

After that I turn on the refresh rate to 60s and merge factor to 10.
Turn on the replicate to 1 (which takes a lot of time to complete btw). But
then I try to index data from Elastica (PHP Client) and get a lots
of "UnavailableShardsException[[index][21] [1] shardIt, [0] active :
Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@6a7604b7]" Errors... I
didn't know what that meant, so I proceed to a full cluster restart (just
in case someone was broken) then when the cluster rebooted, 33shards were
missing from the cluster... Does somebody have an idea about what might
have hapenned ? Moreover, I use bulk (size 100) but I can't seem to be able
to index at 2K/doc/sec... I think that "indexer-only" nodes would be great
for my case but this isn't possible to do so in ES or I'm a wrong ?

I'm kind of depressed right now because that's the 2nd time that
something like that happens to me... does anybody have every managed a ES
cluster with this size ? I'm using tmux ssh to manage servers, but it's
kind of a pain. I'm using Elasticsearch-Head and Bigdesk to monitor my
cluster health.

Other thing, It seems that I have a lot of timeout between the
nodes... I switched from Multicast to unicast to try to fix that, but the
cluster recovered in the state I told you (33 shards missing...).

Moreover, in the small scale benchmarks I ran... I had about 130M doc
by node to have the response time I wanted... but with the same amount of
documents in the full clusters I was nowhere that in response time ? So I'm
wondering what is happening..

Sorry for the long rant messages, but I'm on the edge for my projet
right now and don't know what to do anymore...

--

Andy_Wick · December 11, 2012, 4:06pm

Unless you are using a 0.20 feature you might want to try 0.19.12. The
0.19.x series has become really stable for our large clusters. If 0.20 is
required there is a 0.20.1 now, although we haven't tried it yet.

Andy

On Tuesday, December 11, 2012 10:59:19 AM UTC-5, Jérôme Gagnon wrote:

But now that i think about it, the missing shards weren't on the
filesystem either.. so I don't think it's a permission issue

On Tuesday, December 11, 2012 10:54:25 AM UTC-5, Jérôme Gagnon wrote:

I'm running 0.20.0-RC1

Many thanks

I could do the ls... but I erased the index and started bulk indexation
again, but next time it happens (if it happens), I'll sure check this, it
might be possible that I have missed that (though I'm pretty sure I
double/triple watched this) Already happened to me... but it's good to keep
in mind.

On Tuesday, December 11, 2012 10:46:42 AM UTC-5, Andy Wick wrote:

That almost sounds like a permission issue, as the user that runs
elasticsearch can you do ls
-l '/data/index/cluster_es1b/nodes/0/indices/name/7/index'

Also maybe I missed it but what version of elasticsearch are you using,
later versions (I think 0.19.10) fixed many of our large cluster issues.

Thanks,
Andy

On Tuesday, December 11, 2012 9:18:22 AM UTC-5, Jérôme Gagnon wrote:

Oh, Also just did some more log digging and saw this error on one of
the machine ?

2012-12-10 00:00:00,120][WARN ][cluster.action.shard ] [es54b]
sending failed shard for [type][7], node[ZOaEcxL5RsGsOFn3TovTxg],
relocating [0wpatztbQaOroGbOCAXyYA], [P], s[INITIALIZING], reason [Failed
to create shard, message [IndexShardCreationException[[type][7] failed to
create shard]; nested: IOException[directory
'/data/index/cluster_es1b/nodes/0/indices/name/7/index' exists and is a
directory, but cannot be listed: list() returned null]; ]]

That can't be good ..

On Tuesday, December 11, 2012 9:12:36 AM UTC-5, Jérôme Gagnon wrote:

Lots and lots of fun yes !

I did some searches, were able to have decent results (one the cache
is warmed up adequately, which takes some amount of time with 40+ nodes)
And it worked without any exceptions.

The way I did it is tried indexing without replicas... then got
the UnavailableShardsException like I already told. Saw on the logs the
timeout... tried to put 1 to replications factor, and then restarted
cluster, then the troubles cames.

From the logs got a lot of those between the servers (the one that
shards were told missing)

org.elasticsearch.transport.ConnectTransportException: [es11b][inet[/
10.1.16.153:9300]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:674)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:604)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:574)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:127)
at
org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing$Receiver$1.run(MulticastZenPing.java:536)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:490)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:446)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:359)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more

On Monday, December 10, 2012 10:18:37 PM UTC-5, Otis Gospodnetic wrote:

Hi,

Sounds fun!
Anything worth sharing in the logs (from any of those 40 servers,
hah) worth sharing from before the exception?

You only mentioned indexing. But did search work after you've
indexed your 6B docs?
Actually you do mentioned search in the end, but only in the context
of high latency. But it works without exceptions, right?

You say you increase replication after initial indexing and then get
errors after trying to index some more. Have you also tried indexing some
more before increasing replication?

Otis

ELASTICSEARCH Performance Monitoring -
Sematext Monitoring | Infrastructure Monitoring Service

On Monday, December 10, 2012 3:54:16 PM UTC-5, Jérôme Gagnon wrote:

Hi everyone,

I'm currently on a project that require using more than 40 nodes..
(migration from Sphinx cluster to ES) and I'm actually having hard time
managing everything. I'm able to do the bulk indexation of everything
(6+Billions documents) into the cluster with wonderdog project from
infochimps.

After that I turn on the refresh rate to 60s and merge factor to 10.
Turn on the replicate to 1 (which takes a lot of time to complete btw). But
then I try to index data from Elastica (PHP Client) and get a lots
of "UnavailableShardsException[[index][21] [1] shardIt, [0] active :
Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@6a7604b7]" Errors... I
didn't know what that meant, so I proceed to a full cluster restart (just
in case someone was broken) then when the cluster rebooted, 33shards were
missing from the cluster... Does somebody have an idea about what might
have hapenned ? Moreover, I use bulk (size 100) but I can't seem to be able
to index at 2K/doc/sec... I think that "indexer-only" nodes would be great
for my case but this isn't possible to do so in ES or I'm a wrong ?

I'm kind of depressed right now because that's the 2nd time that
something like that happens to me... does anybody have every managed a ES
cluster with this size ? I'm using tmux ssh to manage servers, but it's
kind of a pain. I'm using Elasticsearch-Head and Bigdesk to monitor my
cluster health.

Other thing, It seems that I have a lot of timeout between the
nodes... I switched from Multicast to unicast to try to fix that, but the
cluster recovered in the state I told you (33 shards missing...).

Moreover, in the small scale benchmarks I ran... I had about 130M
doc by node to have the response time I wanted... but with the same amount
of documents in the full clusters I was nowhere that in response time ? So
I'm wondering what is happening..

Sorry for the long rant messages, but I'm on the edge for my projet
right now and don't know what to do anymore...

--

Jerome_Gagnon · December 11, 2012, 4:34pm

I actually wanted to use Warmup queries ! I should update to 0.20.1 during
the day.

On Tuesday, December 11, 2012 11:06:24 AM UTC-5, Andy Wick wrote:

Unless you are using a 0.20 feature you might want to try 0.19.12. The
0.19.x series has become really stable for our large clusters. If 0.20 is
required there is a 0.20.1 now, although we haven't tried it yet.

Andy

On Tuesday, December 11, 2012 10:59:19 AM UTC-5, Jérôme Gagnon wrote:

But now that i think about it, the missing shards weren't on the
filesystem either.. so I don't think it's a permission issue

On Tuesday, December 11, 2012 10:54:25 AM UTC-5, Jérôme Gagnon wrote:

I'm running 0.20.0-RC1

Many thanks

I could do the ls... but I erased the index and started bulk indexation
again, but next time it happens (if it happens), I'll sure check this, it
might be possible that I have missed that (though I'm pretty sure I
double/triple watched this) Already happened to me... but it's good to keep
in mind.

On Tuesday, December 11, 2012 10:46:42 AM UTC-5, Andy Wick wrote:

That almost sounds like a permission issue, as the user that runs
elasticsearch can you do ls
-l '/data/index/cluster_es1b/nodes/0/indices/name/7/index'

Also maybe I missed it but what version of elasticsearch are you using,
later versions (I think 0.19.10) fixed many of our large cluster issues.

Thanks,
Andy

On Tuesday, December 11, 2012 9:18:22 AM UTC-5, Jérôme Gagnon wrote:

Oh, Also just did some more log digging and saw this error on one of
the machine ?

2012-12-10 00:00:00,120][WARN ][cluster.action.shard ] [es54b]
sending failed shard for [type][7], node[ZOaEcxL5RsGsOFn3TovTxg],
relocating [0wpatztbQaOroGbOCAXyYA], [P], s[INITIALIZING], reason [Failed
to create shard, message [IndexShardCreationException[[type][7] failed to
create shard]; nested: IOException[directory
'/data/index/cluster_es1b/nodes/0/indices/name/7/index' exists and is a
directory, but cannot be listed: list() returned null]; ]]

That can't be good ..

On Tuesday, December 11, 2012 9:12:36 AM UTC-5, Jérôme Gagnon wrote:

Lots and lots of fun yes !

I did some searches, were able to have decent results (one the cache
is warmed up adequately, which takes some amount of time with 40+ nodes)
And it worked without any exceptions.

The way I did it is tried indexing without replicas... then got
the UnavailableShardsException like I already told. Saw on the logs the
timeout... tried to put 1 to replications factor, and then restarted
cluster, then the troubles cames.

From the logs got a lot of those between the servers (the one that
shards were told missing)

org.elasticsearch.transport.ConnectTransportException: [es11b][inet[/
10.1.16.153:9300]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:674)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:604)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:574)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:127)
at
org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing$Receiver$1.run(MulticastZenPing.java:536)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:490)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:446)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:359)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more

On Monday, December 10, 2012 10:18:37 PM UTC-5, Otis Gospodnetic
wrote:

Hi,

Sounds fun!
Anything worth sharing in the logs (from any of those 40 servers,
hah) worth sharing from before the exception?

You only mentioned indexing. But did search work after you've
indexed your 6B docs?
Actually you do mentioned search in the end, but only in the context
of high latency. But it works without exceptions, right?

You say you increase replication after initial indexing and then get
errors after trying to index some more. Have you also tried indexing some
more before increasing replication?

Otis

ELASTICSEARCH Performance Monitoring -
Sematext Monitoring | Infrastructure Monitoring Service

On Monday, December 10, 2012 3:54:16 PM UTC-5, Jérôme Gagnon wrote:

Hi everyone,

I'm currently on a project that require using more than 40 nodes..
(migration from Sphinx cluster to ES) and I'm actually having hard time
managing everything. I'm able to do the bulk indexation of everything
(6+Billions documents) into the cluster with wonderdog project from
infochimps.

After that I turn on the refresh rate to 60s and merge factor to
10. Turn on the replicate to 1 (which takes a lot of time to complete btw).
But then I try to index data from Elastica (PHP Client) and get a lots
of "UnavailableShardsException[[index][21] [1] shardIt, [0] active :
Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@6a7604b7]" Errors... I
didn't know what that meant, so I proceed to a full cluster restart (just
in case someone was broken) then when the cluster rebooted, 33shards were
missing from the cluster... Does somebody have an idea about what might
have hapenned ? Moreover, I use bulk (size 100) but I can't seem to be able
to index at 2K/doc/sec... I think that "indexer-only" nodes would be great
for my case but this isn't possible to do so in ES or I'm a wrong ?

I'm kind of depressed right now because that's the 2nd time that
something like that happens to me... does anybody have every managed a ES
cluster with this size ? I'm using tmux ssh to manage servers, but it's
kind of a pain. I'm using Elasticsearch-Head and Bigdesk to monitor my
cluster health.

Other thing, It seems that I have a lot of timeout between the
nodes... I switched from Multicast to unicast to try to fix that, but the
cluster recovered in the state I told you (33 shards missing...).

Moreover, in the small scale benchmarks I ran... I had about 130M
doc by node to have the response time I wanted... but with the same amount
of documents in the full clusters I was nowhere that in response time ? So
I'm wondering what is happening..

Sorry for the long rant messages, but I'm on the edge for my projet
right now and don't know what to do anymore...

--

Jerome_Gagnon · December 11, 2012, 6:34pm

Updates; Reindexed everything.. I'm now able to index realtime without
issue (I think there was a problem with my 1st bulk indexation and also Im
using unicast now)

The only thing left to find out is the long query latency.... i am between
30-60 seconds per query... So needless to say that I am unsatisfied with
that... I am experiencing lot and lot of IO wait on most of the machines.

On Tuesday, December 11, 2012 11:34:36 AM UTC-5, Jérôme Gagnon wrote:

I actually wanted to use Warmup queries ! I should update to 0.20.1 during
the day.

On Tuesday, December 11, 2012 11:06:24 AM UTC-5, Andy Wick wrote:

Unless you are using a 0.20 feature you might want to try 0.19.12. The
0.19.x series has become really stable for our large clusters. If 0.20 is
required there is a 0.20.1 now, although we haven't tried it yet.

Andy

On Tuesday, December 11, 2012 10:59:19 AM UTC-5, Jérôme Gagnon wrote:

But now that i think about it, the missing shards weren't on the
filesystem either.. so I don't think it's a permission issue

On Tuesday, December 11, 2012 10:54:25 AM UTC-5, Jérôme Gagnon wrote:

I'm running 0.20.0-RC1

Many thanks

I could do the ls... but I erased the index and started bulk indexation
again, but next time it happens (if it happens), I'll sure check this, it
might be possible that I have missed that (though I'm pretty sure I
double/triple watched this) Already happened to me... but it's good to keep
in mind.

On Tuesday, December 11, 2012 10:46:42 AM UTC-5, Andy Wick wrote:

That almost sounds like a permission issue, as the user that runs
elasticsearch can you do ls
-l '/data/index/cluster_es1b/nodes/0/indices/name/7/index'

Also maybe I missed it but what version of elasticsearch are you
using, later versions (I think 0.19.10) fixed many of our large cluster
issues.

Thanks,
Andy

On Tuesday, December 11, 2012 9:18:22 AM UTC-5, Jérôme Gagnon wrote:

Oh, Also just did some more log digging and saw this error on one of
the machine ?

2012-12-10 00:00:00,120][WARN ][cluster.action.shard ] [es54b]
sending failed shard for [type][7], node[ZOaEcxL5RsGsOFn3TovTxg],
relocating [0wpatztbQaOroGbOCAXyYA], [P], s[INITIALIZING], reason [Failed
to create shard, message [IndexShardCreationException[[type][7] failed to
create shard]; nested: IOException[directory
'/data/index/cluster_es1b/nodes/0/indices/name/7/index' exists and is a
directory, but cannot be listed: list() returned null]; ]]

That can't be good ..

On Tuesday, December 11, 2012 9:12:36 AM UTC-5, Jérôme Gagnon wrote:

Lots and lots of fun yes !

I did some searches, were able to have decent results (one the cache
is warmed up adequately, which takes some amount of time with 40+ nodes)
And it worked without any exceptions.

The way I did it is tried indexing without replicas... then got
the UnavailableShardsException like I already told. Saw on the logs the
timeout... tried to put 1 to replications factor, and then restarted
cluster, then the troubles cames.

From the logs got a lot of those between the servers (the one that
shards were told missing)

org.elasticsearch.transport.ConnectTransportException: [es11b][inet[/
10.1.16.153:9300]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:674)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:604)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:574)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:127)
at
org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing$Receiver$1.run(MulticastZenPing.java:536)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:490)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:446)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:359)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more

On Monday, December 10, 2012 10:18:37 PM UTC-5, Otis Gospodnetic
wrote:

Hi,

Sounds fun!
Anything worth sharing in the logs (from any of those 40 servers,
hah) worth sharing from before the exception?

You only mentioned indexing. But did search work after you've
indexed your 6B docs?
Actually you do mentioned search in the end, but only in the
context of high latency. But it works without exceptions, right?

You say you increase replication after initial indexing and then
get errors after trying to index some more. Have you also tried indexing
some more before increasing replication?

Otis

ELASTICSEARCH Performance Monitoring -
Sematext Monitoring | Infrastructure Monitoring Service

On Monday, December 10, 2012 3:54:16 PM UTC-5, Jérôme Gagnon wrote:

Hi everyone,

I'm currently on a project that require using more than 40 nodes..
(migration from Sphinx cluster to ES) and I'm actually having hard time
managing everything. I'm able to do the bulk indexation of everything
(6+Billions documents) into the cluster with wonderdog project from
infochimps.

After that I turn on the refresh rate to 60s and merge factor to
10. Turn on the replicate to 1 (which takes a lot of time to complete btw).
But then I try to index data from Elastica (PHP Client) and get a lots
of "UnavailableShardsException[[index][21] [1] shardIt, [0] active :
Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@6a7604b7]" Errors... I
didn't know what that meant, so I proceed to a full cluster restart (just
in case someone was broken) then when the cluster rebooted, 33shards were
missing from the cluster... Does somebody have an idea about what might
have hapenned ? Moreover, I use bulk (size 100) but I can't seem to be able
to index at 2K/doc/sec... I think that "indexer-only" nodes would be great
for my case but this isn't possible to do so in ES or I'm a wrong ?

I'm kind of depressed right now because that's the 2nd time that
something like that happens to me... does anybody have every managed a ES
cluster with this size ? I'm using tmux ssh to manage servers, but it's
kind of a pain. I'm using Elasticsearch-Head and Bigdesk to monitor my
cluster health.

Other thing, It seems that I have a lot of timeout between the
nodes... I switched from Multicast to unicast to try to fix that, but the
cluster recovered in the state I told you (33 shards missing...).

Moreover, in the small scale benchmarks I ran... I had about 130M
doc by node to have the response time I wanted... but with the same amount
of documents in the full clusters I was nowhere that in response time ? So
I'm wondering what is happening..

Sorry for the long rant messages, but I'm on the edge for my
projet right now and don't know what to do anymore...

--

radu_gheorghe · December 12, 2012, 10:42am

Hello Jerome,

If you get lots of IO maybe you can look at merging and see if that's the
cause of the pain.

If not, I would try to get the OS (and maybe ES as well if there's enough
memory) to do more caching.

Turning down the number of shards might help, if that's an option in your
case (especially since you can't do that on the fly, so you'd have to
reindex).

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Tue, Dec 11, 2012 at 8:34 PM, Jérôme Gagnon jerome.gagnon.1@gmail.comwrote:

Updates; Reindexed everything.. I'm now able to index realtime without
issue (I think there was a problem with my 1st bulk indexation and also Im
using unicast now)

The only thing left to find out is the long query latency.... i am between
30-60 seconds per query... So needless to say that I am unsatisfied with
that... I am experiencing lot and lot of IO wait on most of the machines.

On Tuesday, December 11, 2012 11:34:36 AM UTC-5, Jérôme Gagnon wrote:

I actually wanted to use Warmup queries ! I should update to 0.20.1
during the day.

On Tuesday, December 11, 2012 11:06:24 AM UTC-5, Andy Wick wrote:

Unless you are using a 0.20 feature you might want to try 0.19.12. The
0.19.x series has become really stable for our large clusters. If 0.20 is
required there is a 0.20.1 now, although we haven't tried it yet.

Andy

On Tuesday, December 11, 2012 10:59:19 AM UTC-5, Jérôme Gagnon wrote:

But now that i think about it, the missing shards weren't on the
filesystem either.. so I don't think it's a permission issue

On Tuesday, December 11, 2012 10:54:25 AM UTC-5, Jérôme Gagnon wrote:

I'm running 0.20.0-RC1

Many thanks

I could do the ls... but I erased the index and started bulk
indexation again, but next time it happens (if it happens), I'll sure check
this, it might be possible that I have missed that (though I'm pretty sure
I double/triple watched this) Already happened to me... but it's good to
keep in mind.

On Tuesday, December 11, 2012 10:46:42 AM UTC-5, Andy Wick wrote:

That almost sounds like a permission issue, as the user that runs
elasticsearch can you do ls -l '/data/index/cluster_es1b/**
nodes/0/indices/name/7/index'

Also maybe I missed it but what version of elasticsearch are you
using, later versions (I think 0.19.10) fixed many of our large cluster
issues.

Thanks,
Andy

On Tuesday, December 11, 2012 9:18:22 AM UTC-5, Jérôme Gagnon wrote:

Oh, Also just did some more log digging and saw this error on one of
the machine ?

2012-12-10 00:00:00,120][WARN ][cluster.action.shard ] [es54b]
sending failed shard for [type][7], node[ZOaEcxL5RsGsOFn3TovTxg],
relocating [0wpatztbQaOroGbOCAXyYA], [P], s[INITIALIZING], reason [Failed
to create shard, message [IndexShardCreationException[[**type][7]
failed to create shard]; nested: IOException[directory
'/data/index/cluster_es1b/**nodes/0/indices/name/7/index' exists
and is a directory, but cannot be listed: list() returned null]; ]]

That can't be good ..

On Tuesday, December 11, 2012 9:12:36 AM UTC-5, Jérôme Gagnon wrote:

Lots and lots of fun yes !

I did some searches, were able to have decent results (one the
cache is warmed up adequately, which takes some amount of time with 40+
nodes) And it worked without any exceptions.

The way I did it is tried indexing without replicas... then got the
**UnavailableShardsException **like I already told. Saw on the
logs the timeout... tried to put 1 to replications factor, and then
restarted cluster, then the troubles cames.

From the logs got a lot of those between the servers (the one that
shards were told missing)

org.elasticsearch.transport.ConnectTransportException:
[es11b][inet[/10.1.16.153:9300]] connect_timeout[30s]
at org.elasticsearch.transport.netty.NettyTransport.
connectToChannels(**NettyTransport.java:674)
at org.elasticsearch.transport.netty.NettyTransport.
connectToNode(NettyTransport.**java:604)
at org.elasticsearch.transport.netty.NettyTransport.
connectToNode(NettyTransport.**java:574)
at org.elasticsearch.transport.**TransportService.*connectToNode(
*TransportService.java:127)
at org.elasticsearch.discovery.zen.ping.multicast.
MulticastZenPing$Receiver$1.**run(MulticastZenPing.java:536)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.**java:722)
Caused by: java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.**checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(
SocketChannelImpl.java:701)
at org.elasticsearch.common.netty.channel.socket.nio.
NioClientSocketPipelineSink$Boss.connect(
NioClientSocketPipelineSink.**java:490)
at org.elasticsearch.common.netty.channel.socket.nio.
NioClientSocketPipelineSink$Boss.processSelectedKeys(
NioClientSocketPipelineSink.**java:446)
at org.elasticsearch.common.netty.channel.socket.nio.
NioClientSocketPipelineSink$Boss.run(
NioClientSocketPipelineSink.**java:359)
at org.elasticsearch.common.netty.util.
ThreadRenamingRunnable.run(**ThreadRenamingRunnable.java:**102)
at org.elasticsearch.common.netty.util.internal.
DeadLockProofWorker$1.run(**DeadLockProofWorker.java:42)
... 3 more

On Monday, December 10, 2012 10:18:37 PM UTC-5, Otis Gospodnetic
wrote:

Hi,

Sounds fun!
Anything worth sharing in the logs (from any of those 40 servers,
hah) worth sharing from before the exception?

You only mentioned indexing. But did search work after you've
indexed your 6B docs?
Actually you do mentioned search in the end, but only in the
context of high latency. But it works without exceptions, right?

You say you increase replication after initial indexing and then
get errors after trying to index some more. Have you also tried indexing
some more before increasing replication?

Otis

ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service**
index.html http://sematext.com/spm/index.html

On Monday, December 10, 2012 3:54:16 PM UTC-5, Jérôme Gagnon wrote:

Hi everyone,

I'm currently on a project that require using more than 40
nodes.. (migration from Sphinx cluster to ES) and I'm actually having hard
time managing everything. I'm able to do the bulk indexation of everything
(6+Billions documents) into the cluster with wonderdog project from
infochimps.

After that I turn on the refresh rate to 60s and merge factor to
10. Turn on the replicate to 1 (which takes a lot of time to complete btw).
But then I try to index data from Elastica (PHP Client) and get a lots of "
**UnavailableShardsException[[**index][21] [1] shardIt, [0]
active : Timeout waiting for [1m], request: org.elasticsearch.action.bulk.
**BulkShardRequest@6a7604b7]" Errors... I didn't know what that
meant, so I proceed to a full cluster restart (just in case someone was
broken) then when the cluster rebooted, 33shards were missing from the
cluster... Does somebody have an idea about what might have hapenned ?
Moreover, I use bulk (size 100) but I can't seem to be able to index at
2K/doc/sec... I think that "indexer-only" nodes would be great for my case
but this isn't possible to do so in ES or I'm a wrong ?

I'm kind of depressed right now because that's the 2nd time that
something like that happens to me... does anybody have every managed a ES
cluster with this size ? I'm using tmux ssh to manage servers, but it's
kind of a pain. I'm using Elasticsearch-Head and Bigdesk to monitor my
cluster health.

Other thing, It seems that I have a lot of timeout between the
nodes... I switched from Multicast to unicast to try to fix that, but the
cluster recovered in the state I told you (33 shards missing...).

Moreover, in the small scale benchmarks I ran... I had about 130M
doc by node to have the response time I wanted... but with the same amount
of documents in the full clusters I was nowhere that in response time ? So
I'm wondering what is happening..

Sorry for the long rant messages, but I'm on the edge for my
projet right now and don't know what to do anymore...

--

--

Topic		Replies	Views
New index immediately becomes red Elasticsearch	8	2061	July 6, 2017
Out of memory, missing shards, looks like split-brain Elasticsearch	10	848	July 6, 2017
Very weird ES Cluster state problem! Elasticsearch	8	542	July 6, 2017
Upper limits on indexes/shards in a cluster Elasticsearch	11	1202	July 6, 2017
Shard timeout problem on AWS Elasticsearch	8	422	July 6, 2017

ElasticSearch with > 40 nodes, missing shards and indexing troubles

Otis

Otis

Otis

Otis

Otis

Otis

Otis

Otis

Otis

Best regards, Radu

Otis

Related topics

Best regards,
Radu