Hi Clinton, yeah we are having some problems here. We narrow to some nasty
queries that frontend started to execute, it seems that those queries (some
count queries) were ok, but the moment we started having more and more of
them (they had a spike in numbers) we started seen nodes getting the
threadpool (search) number spike to several thousand, this node would then
get removed from the cluster (I'm assuming it would be unresponsive), after
that it's a snowball, the shards starts to be relocated, IO gets high,
another node threadpool search goes crazy high in number, and again the
cluster gets removed, we are left with 2 nodes only, and since we have
zen.discovery = 3, most of the time, we are left without a master (don't
know how bad that is)... final result : whole cluster restart.
We got some es consulting, which was amazing, we are now looking for
production subscription, but we really need to get this a bit more stable.
I'm 100% sure that the blame is on us, we are just running bad queries, I
only wish I had the time to look into it, but as in most companies
management only cares about design when all hell breaks loose like now. Now
every manager wants to get a new machine, or we should dedicate more time
on designing queries, but before that, is just: "Can you deliver this
feature this afternoon?"
Regards
On Tuesday, March 19, 2013 7:49:06 AM UTC-4, Clinton Gormley wrote:
Hiya
Today we have a 5 node ES cluster (each node has 64GB RAM 12 cores),
our index is around 96gb spread across 12 shards. We have replica = 1
And each ES instance is set to have 31GB of Xmx.
You're getting outages with an index of only 96GB, on those machines?
That surprises me. I'd be looking for what is causing the outages,
rather than changing cluster size.
64 medium nodes on amazon, each with 7.5Gb RAM, increase our shard
numbers to a large number like 64, so each shard would have around
1.5gb only, and set replica to 2 so each node would host 3 shards
only. And the rationale behind that would be: 3x 1.5gb = 4.5gb wich is
around what we expect to have as filesystem cache memory reserved. 4gb
for ES and leave 3.5 for the OS
Dividing such a small index into 64 shards will probably skew the term
frequency distribution, which may result in unexpected results.
Also, the smaller the instance on EC2, the poorer the IO throughput, the
noisier the neighbours etc.
Again, I'd try to figure out why you're getting outages - it is probably
quite easy to solve.
clint
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.