I am running a "staging" 6 nodes cluster which, for now, only indexes
our realtime data, which is a duplicate of what we are indexing in
production, roughly 10-15M small documents per day, stored in "daily"
indices. Not much search load is performed on this cluster yet, mainly
only indexing data, using the bulk api.
Cluster is running 0.18.4, on EC2/Ubuntu 10.10 using OpenJDK Runtime
Environment (IcedTea6 1.9.10) (6b20-1.9.10-0ubuntu1~10.10.2) ( java
On this cluster all nodes are http+data and I setup nginx with reverse
proxy to load balance all ES http queries in round robin across the
Now, the problem I am seeing is yellow/red cluster health status,
starting right after the new index creation at midnight. after that it
get "stuck" at relocating_shards, and initializing_shards. To solve
this I have to identify the node with the "stuck" shard and restart
it. This solves the problem.
Here's a theory: my daily indices are implicitly created when the
first document with a date on the new day is indexed. I am not
explicitly calling the create index api. Could there be some race
condition with the implicit index creation when all nodes receive
almost at the same time documents for a new inexistant index? (since I
am load balancing all my http requests across all nodes on my
I tried pre-creating the next index and the problem did not show up. I
don't have much to back this theory, and there isn't any error log
showing up to diagnose the problem.
One thing I did this morning is add an extra http-only node, which I
will use only to handle all my indexing queries. The search queries
will be load balanced on the 6 data nodes. This is pretty-much the
setup we currently have in production which has been working well so
far. I figured funnelling all indexing queries through a single node
might not be prone to my theoretical race condition.