So we're implementing ElasticSearch in a few production systems, and we've
run into this show stopper of a bug. Here's the setup:
- Cluster of 5 servers.
- ~500 gigs of data per server (n+1 redundancy for all indexes)
- ~3-4 indexes.
- Unicast clustering.
- 16 gigs of ram per box, ~60% allocated to Java heap.
- No swapping/memory issues.
After an indeterminate amount of time, running a query like so:
Will return a certain number of results, say, 123,456. However if you run
the same exact query on the same server a second time, the result count
(and data set) will be entirely different, ie: 122,222. Run it again, and
you get the first result set. It will alternate indefinitely until a full
cluster restart is done. A few things I have noticed:
- This may or may not happen when a server drops out/goes offline.
- This does not always happen only when a server goes offline.
- The query run does not matter, results will alternate no matter what.
- Calling a _flush on an index does not fix this.
- It can happen to one index at one moment, and not another, but
eventually happens to all of them.
- The alternating results only happens on a single cluster member, not