Cluster crash on query

We had a 0.19.11 cluster in production for a few weeks. One of our devs
rolled out some new code and the cluster bascially came down on itself
hard, first complaining of missing shards, and then beginning to ostensibly
work when I began manually reindexing data.

No migrations were included in the deploy, only some new queries (Which
unfortunately I don't have with me now, but I can add tomorrow)

The logs from the three nodes are at:

https://s3.amazonaws.com/99designs-elasticsearch-logs/ip-10-29-24-75.log
https://s3.amazonaws.com/99designs-elasticsearch-logs/ip-10-64-38-196.log
https://s3.amazonaws.com/99designs-elasticsearch-logs/ip-10-85-75-6.log

Any insight would be welcomed. I've replaced the cluster with one running
0.20.1 for now but really have no guarantees the underlying issue is solved.

Richo

--

Hello Richo,

From the logs I think your nodes just got too busy and nodes couldn't see
each other because of that. So I think you need to either:

  • optimize the performance on what you already have (if that's possible)
  • add more nodes
  • use bigger nodes

I think that in order to provide more help, one would need some more
information, like:

  • how many nodes you have, and how big they are?
  • what's the ES configuration, especially around discovery?
  • how many indices and shards you have? how much data is in them? how does
    the data look like (mapping)?
  • how do the new queries look like?

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Sun, Dec 16, 2012 at 1:00 PM, Richo Healey healey.rich@gmail.com wrote:

We had a 0.19.11 cluster in production for a few weeks. One of our devs
rolled out some new code and the cluster bascially came down on itself
hard, first complaining of missing shards, and then beginning to ostensibly
work when I began manually reindexing data.

No migrations were included in the deploy, only some new queries (Which
unfortunately I don't have with me now, but I can add tomorrow)

The logs from the three nodes are at:

https://s3.amazonaws.com/99designs-elasticsearch-logs/ip-10-29-24-75.log
https://s3.amazonaws.com/99designs-elasticsearch-logs/ip-10-64-38-196.log
https://s3.amazonaws.com/99designs-elasticsearch-logs/ip-10-85-75-6.log

Any insight would be welcomed. I've replaced the cluster with one running
0.20.1 for now but really have no guarantees the underlying issue is solved.

Richo

--

--

Hi Radu,

I can certainly add more/larger nodes, but doing so seems like an issue,
from what I can see the cluster didn't fall over under load, it actually
seems to have lost track of it's shards.

I lit up an old cluster this morning, and totally idle sent it the query
that broke this one. It immediately went red and started throwing the same
errors.

The cluster in question is 3x m1.large's, using aws plugin for discovery,
fetching nodes in the Security Group (there are only 3)
5 shards, 2 replicas.
Only one index

I'll dump the mapping and the query in a moment, there are about 180k
records.

On Monday, 17 December 2012 23:45:50 UTC+11, Radu Gheorghe wrote:

Hello Richo,

From the logs I think your nodes just got too busy and nodes couldn't see
each other because of that. So I think you need to either:

  • optimize the performance on what you already have (if that's possible)
  • add more nodes
  • use bigger nodes

I think that in order to provide more help, one would need some more
information, like:

  • how many nodes you have, and how big they are?
  • what's the ES configuration, especially around discovery?
  • how many indices and shards you have? how much data is in them? how does
    the data look like (mapping)?
  • how do the new queries look like?

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Sun, Dec 16, 2012 at 1:00 PM, Richo Healey <heale...@gmail.com<javascript:>

wrote:

We had a 0.19.11 cluster in production for a few weeks. One of our devs
rolled out some new code and the cluster bascially came down on itself
hard, first complaining of missing shards, and then beginning to ostensibly
work when I began manually reindexing data.

No migrations were included in the deploy, only some new queries (Which
unfortunately I don't have with me now, but I can add tomorrow)

The logs from the three nodes are at:

https://s3.amazonaws.com/99designs-elasticsearch-logs/ip-10-29-24-75.log
https://s3.amazonaws.com/99designs-elasticsearch-logs/ip-10-64-38-196.log
https://s3.amazonaws.com/99designs-elasticsearch-logs/ip-10-85-75-6.log

Any insight would be welcomed. I've replaced the cluster with one running
0.20.1 for now but really have no guarantees the underlying issue is solved.

Richo

--

--