We've really been enjoying elasticsearch, where we were having some
unpleasantness with Sphinx. At any rate, we're continuously inserting
batches of documents through a job queue, using the python bindings
(https://github.com/aparo/pyes), and after a couple of hours, the cluster
reliably falls apart. We've got 6 m1.larges running the thing with 300
shards (we want to have room to grow if need be).
I've noticed that it periodically complains about the number of open file
descriptors, despite it being set to 64k on each machine. From the logs, it
seems that eventually shard recovery fails, and then it goes downhill from
there. Restarting the whole cluster brings it back into the green, but I
have a feeling that's not meant to be a reguarly-run ops issue.
When inserting documents, we're using the bulk API, and then waiting for
green status. Rather, that's what we're requesting of the python API. I'm
not entirely convinced that there's not something going on in that library,
but I've not been able to track anything obvious down in there.
Gist with mapping, health, machine
Is this symptomatic of a common problem? Or a known problem? I imagine it's
something I've not set up correctly :-/