Stability problems while indexing with 0.18.4

Hi,

I am running a "staging" 6 nodes cluster which, for now, only indexes
our realtime data, which is a duplicate of what we are indexing in
production, roughly 10-15M small documents per day, stored in "daily"
indices. Not much search load is performed on this cluster yet, mainly
only indexing data, using the bulk api.

Cluster is running 0.18.4, on EC2/Ubuntu 10.10 using OpenJDK Runtime
Environment (IcedTea6 1.9.10) (6b20-1.9.10-0ubuntu1~10.10.2) ( java
version "1.6.0_20")

On this cluster all nodes are http+data and I setup nginx with reverse
proxy to load balance all ES http queries in round robin across the
cluster.

Now, the problem I am seeing is yellow/red cluster health status,
starting right after the new index creation at midnight. after that it
get "stuck" at relocating_shards, and initializing_shards. To solve
this I have to identify the node with the "stuck" shard and restart
it. This solves the problem.

Here's a theory: my daily indices are implicitly created when the
first document with a date on the new day is indexed. I am not
explicitly calling the create index api. Could there be some race
condition with the implicit index creation when all nodes receive
almost at the same time documents for a new inexistant index? (since I
am load balancing all my http requests across all nodes on my
cluster).

I tried pre-creating the next index and the problem did not show up. I
don't have much to back this theory, and there isn't any error log
showing up to diagnose the problem.

One thing I did this morning is add an extra http-only node, which I
will use only to handle all my indexing queries. The search queries
will be load balanced on the 6 data nodes. This is pretty-much the
setup we currently have in production which has been working well so
far. I figured funnelling all indexing queries through a single node
might not be prone to my theoretical race condition.

Any thoughts?

Thanks,
Colin

On Nov 28, 2011 10:12 PM, "Colin Surprenant" colin.surprenant@gmail.com
wrote:

(since I
am load balancing all my http requests across all nodes on my
cluster).

That sounds like it could cause problems.

Why are you load balancing it in the first place instead of having the
clients round-robin?

just a layering strategy, so that the client does not have to deal
with doing round robin. changes in the topology does not require a
client update. simply deal with it at nginx config.

On Mon, Nov 28, 2011 at 4:29 PM, Ævar Arnfjörð Bjarmason
avarab@gmail.com wrote:

On Nov 28, 2011 10:12 PM, "Colin Surprenant" colin.surprenant@gmail.com
wrote:

(since I
am load balancing all my http requests across all nodes on my
cluster).

That sounds like it could cause problems.

Why are you load balancing it in the first place instead of having the
clients round-robin?

I'm experiencing a similar issue on 0.18.4.

I also have daily indexes of about ~40 million documents with 16 shards/2 replicas per index across 8 nodes.

The first couple of indexes work fine with indexing and searching, once I get to ~7-8 indexes things start to go funny. I normally find that 1-3 nodes have their CPU pegged at 100%.

Each node is a VM running Centos 5.6 x64 4Gb and 4 vCPUs (on VMware).

Any ideas? Should I not be having so many indexes?

thanks

Its perfectly fine to have a load balancer in front of elasticsearch nodes.
When you execute the bulk API against the cluster, and an index needs to be
created as a result of it, it will automatically execute the create index
API and once its done, execute the bulk API. The create index API always
executes on the elected master in the cluster.

Its strange that it hangs, would love to see the log on the master node. I
just ran a 6 node cluster and execute bulk indexing on each one
concurrently indexing into a non existent index, and it worked well.

How easy is it to recreate? Trying to understand whats going on is going to
be a bit tricky. If you have time, I can create debug version of 0.18 and
we can try and figure it out online on IRC.

-shay.banon

On Mon, Nov 28, 2011 at 11:12 PM, Colin Surprenant <
colin.surprenant@gmail.com> wrote:

Hi,

I am running a "staging" 6 nodes cluster which, for now, only indexes
our realtime data, which is a duplicate of what we are indexing in
production, roughly 10-15M small documents per day, stored in "daily"
indices. Not much search load is performed on this cluster yet, mainly
only indexing data, using the bulk api.

Cluster is running 0.18.4, on EC2/Ubuntu 10.10 using OpenJDK Runtime
Environment (IcedTea6 1.9.10) (6b20-1.9.10-0ubuntu1~10.10.2) ( java
version "1.6.0_20")

On this cluster all nodes are http+data and I setup nginx with reverse
proxy to load balance all ES http queries in round robin across the
cluster.

Now, the problem I am seeing is yellow/red cluster health status,
starting right after the new index creation at midnight. after that it
get "stuck" at relocating_shards, and initializing_shards. To solve
this I have to identify the node with the "stuck" shard and restart
it. This solves the problem.

Here's a theory: my daily indices are implicitly created when the
first document with a date on the new day is indexed. I am not
explicitly calling the create index api. Could there be some race
condition with the implicit index creation when all nodes receive
almost at the same time documents for a new inexistant index? (since I
am load balancing all my http requests across all nodes on my
cluster).

I tried pre-creating the next index and the problem did not show up. I
don't have much to back this theory, and there isn't any error log
showing up to diagnose the problem.

One thing I did this morning is add an extra http-only node, which I
will use only to handle all my indexing queries. The search queries
will be load balanced on the 6 data nodes. This is pretty-much the
setup we currently have in production which has been working well so
far. I figured funnelling all indexing queries through a single node
might not be prone to my theoretical race condition.

Any thoughts?

Thanks,
Colin

I am going to spend some time on trying to reproduce the problem in a
more systematic way, once I have this, we'll get in touch to figure
out what's happening, hopefully later today.

Do you suggest I upgrade to 0.18.5 before starting my tests?

Thanks,
Colin

On Tue, Nov 29, 2011 at 4:27 AM, Shay Banon kimchy@gmail.com wrote:

Its perfectly fine to have a load balancer in front of elasticsearch nodes.
When you execute the bulk API against the cluster, and an index needs to be
created as a result of it, it will automatically execute the create index
API and once its done, execute the bulk API. The create index API always
executes on the elected master in the cluster.
Its strange that it hangs, would love to see the log on the master node. I
just ran a 6 node cluster and execute bulk indexing on each one concurrently
indexing into a non existent index, and it worked well.
How easy is it to recreate? Trying to understand whats going on is going to
be a bit tricky. If you have time, I can create debug version of 0.18 and we
can try and figure it out online on IRC.
-shay.banon

On Mon, Nov 28, 2011 at 11:12 PM, Colin Surprenant
colin.surprenant@gmail.com wrote:

Hi,

I am running a "staging" 6 nodes cluster which, for now, only indexes
our realtime data, which is a duplicate of what we are indexing in
production, roughly 10-15M small documents per day, stored in "daily"
indices. Not much search load is performed on this cluster yet, mainly
only indexing data, using the bulk api.

Cluster is running 0.18.4, on EC2/Ubuntu 10.10 using OpenJDK Runtime
Environment (IcedTea6 1.9.10) (6b20-1.9.10-0ubuntu1~10.10.2) ( java
version "1.6.0_20")

On this cluster all nodes are http+data and I setup nginx with reverse
proxy to load balance all ES http queries in round robin across the
cluster.

Now, the problem I am seeing is yellow/red cluster health status,
starting right after the new index creation at midnight. after that it
get "stuck" at relocating_shards, and initializing_shards. To solve
this I have to identify the node with the "stuck" shard and restart
it. This solves the problem.

Here's a theory: my daily indices are implicitly created when the
first document with a date on the new day is indexed. I am not
explicitly calling the create index api. Could there be some race
condition with the implicit index creation when all nodes receive
almost at the same time documents for a new inexistant index? (since I
am load balancing all my http requests across all nodes on my
cluster).

I tried pre-creating the next index and the problem did not show up. I
don't have much to back this theory, and there isn't any error log
showing up to diagnose the problem.

One thing I did this morning is add an extra http-only node, which I
will use only to handle all my indexing queries. The search queries
will be load balanced on the 6 data nodes. This is pretty-much the
setup we currently have in production which has been working well so
far. I figured funnelling all indexing queries through a single node
might not be prone to my theoretical race condition.

Any thoughts?

Thanks,
Colin

Yea, if you can, supporting based on 0.18.5 will be simpler.

On Tue, Nov 29, 2011 at 6:20 PM, Colin Surprenant <
colin.surprenant@gmail.com> wrote:

I am going to spend some time on trying to reproduce the problem in a
more systematic way, once I have this, we'll get in touch to figure
out what's happening, hopefully later today.

Do you suggest I upgrade to 0.18.5 before starting my tests?

Thanks,
Colin

On Tue, Nov 29, 2011 at 4:27 AM, Shay Banon kimchy@gmail.com wrote:

Its perfectly fine to have a load balancer in front of elasticsearch
nodes.
When you execute the bulk API against the cluster, and an index needs to
be
created as a result of it, it will automatically execute the create index
API and once its done, execute the bulk API. The create index API always
executes on the elected master in the cluster.
Its strange that it hangs, would love to see the log on the master node.
I
just ran a 6 node cluster and execute bulk indexing on each one
concurrently
indexing into a non existent index, and it worked well.
How easy is it to recreate? Trying to understand whats going on is going
to
be a bit tricky. If you have time, I can create debug version of 0.18
and we
can try and figure it out online on IRC.
-shay.banon

On Mon, Nov 28, 2011 at 11:12 PM, Colin Surprenant
colin.surprenant@gmail.com wrote:

Hi,

I am running a "staging" 6 nodes cluster which, for now, only indexes
our realtime data, which is a duplicate of what we are indexing in
production, roughly 10-15M small documents per day, stored in "daily"
indices. Not much search load is performed on this cluster yet, mainly
only indexing data, using the bulk api.

Cluster is running 0.18.4, on EC2/Ubuntu 10.10 using OpenJDK Runtime
Environment (IcedTea6 1.9.10) (6b20-1.9.10-0ubuntu1~10.10.2) ( java
version "1.6.0_20")

On this cluster all nodes are http+data and I setup nginx with reverse
proxy to load balance all ES http queries in round robin across the
cluster.

Now, the problem I am seeing is yellow/red cluster health status,
starting right after the new index creation at midnight. after that it
get "stuck" at relocating_shards, and initializing_shards. To solve
this I have to identify the node with the "stuck" shard and restart
it. This solves the problem.

Here's a theory: my daily indices are implicitly created when the
first document with a date on the new day is indexed. I am not
explicitly calling the create index api. Could there be some race
condition with the implicit index creation when all nodes receive
almost at the same time documents for a new inexistant index? (since I
am load balancing all my http requests across all nodes on my
cluster).

I tried pre-creating the next index and the problem did not show up. I
don't have much to back this theory, and there isn't any error log
showing up to diagnose the problem.

One thing I did this morning is add an extra http-only node, which I
will use only to handle all my indexing queries. The search queries
will be load balanced on the 6 data nodes. This is pretty-much the
setup we currently have in production which has been working well so
far. I figured funnelling all indexing queries through a single node
might not be prone to my theoretical race condition.

Any thoughts?

Thanks,
Colin

Damn it, can't reproduce the problem. I will restart the environment
exactly like it was setup previously, but with DEBUG logging level.
If/when it fails, we should have a better view of what's happening.

Colin

On Tue, Nov 29, 2011 at 11:23 AM, Shay Banon kimchy@gmail.com wrote:

Yea, if you can, supporting based on 0.18.5 will be simpler.

On Tue, Nov 29, 2011 at 6:20 PM, Colin Surprenant
colin.surprenant@gmail.com wrote:

I am going to spend some time on trying to reproduce the problem in a
more systematic way, once I have this, we'll get in touch to figure
out what's happening, hopefully later today.

Do you suggest I upgrade to 0.18.5 before starting my tests?

Thanks,
Colin

On Tue, Nov 29, 2011 at 4:27 AM, Shay Banon kimchy@gmail.com wrote:

Its perfectly fine to have a load balancer in front of elasticsearch
nodes.
When you execute the bulk API against the cluster, and an index needs to
be
created as a result of it, it will automatically execute the create
index
API and once its done, execute the bulk API. The create index API always
executes on the elected master in the cluster.
Its strange that it hangs, would love to see the log on the master node.
I
just ran a 6 node cluster and execute bulk indexing on each one
concurrently
indexing into a non existent index, and it worked well.
How easy is it to recreate? Trying to understand whats going on is going
to
be a bit tricky. If you have time, I can create debug version of 0.18
and we
can try and figure it out online on IRC.
-shay.banon

On Mon, Nov 28, 2011 at 11:12 PM, Colin Surprenant
colin.surprenant@gmail.com wrote:

Hi,

I am running a "staging" 6 nodes cluster which, for now, only indexes
our realtime data, which is a duplicate of what we are indexing in
production, roughly 10-15M small documents per day, stored in "daily"
indices. Not much search load is performed on this cluster yet, mainly
only indexing data, using the bulk api.

Cluster is running 0.18.4, on EC2/Ubuntu 10.10 using OpenJDK Runtime
Environment (IcedTea6 1.9.10) (6b20-1.9.10-0ubuntu1~10.10.2) ( java
version "1.6.0_20")

On this cluster all nodes are http+data and I setup nginx with reverse
proxy to load balance all ES http queries in round robin across the
cluster.

Now, the problem I am seeing is yellow/red cluster health status,
starting right after the new index creation at midnight. after that it
get "stuck" at relocating_shards, and initializing_shards. To solve
this I have to identify the node with the "stuck" shard and restart
it. This solves the problem.

Here's a theory: my daily indices are implicitly created when the
first document with a date on the new day is indexed. I am not
explicitly calling the create index api. Could there be some race
condition with the implicit index creation when all nodes receive
almost at the same time documents for a new inexistant index? (since I
am load balancing all my http requests across all nodes on my
cluster).

I tried pre-creating the next index and the problem did not show up. I
don't have much to back this theory, and there isn't any error log
showing up to diagnose the problem.

One thing I did this morning is add an extra http-only node, which I
will use only to handle all my indexing queries. The search queries
will be load balanced on the 6 data nodes. This is pretty-much the
setup we currently have in production which has been working well so
far. I figured funnelling all indexing queries through a single node
might not be prone to my theoretical race condition.

Any thoughts?

Thanks,
Colin