I have a single node on a 32GB RAM machine, running ES 5.6.2, which is collecting a reasonable volume of nginx logs. It's just about working well enough and we are in the process of moving over to a 3-node cluster with up-to-date software.
In the meantime, I have a single report that I ideally need to be able to run, doing a unique count aggregation across 30days of data.
I can juuuuuuust about get it to run, usually by running 1 day, 7 days, 14 days, 21 days and finally 30days (does that even make sense?)... but more often than not it still falls over with the classic OOM error.
Sometimes it just shows a timeout error, but normally it full-on crashes and I have to manually start ES again.
Extensive googling around lead me to increasing the heap space from 2GB to 12GB - in fact, it was only doing that which resulted it being able to run the report at all. I tried it at 50% RAM which is 16GB, but that seemed less stable as I believe logstash is also using plenty of RAM.
Honestly, I'm mostly frustrated at not finding any hints, docs or info about what to do with the OOM error other than increase the heap space. What's next? What should I be reading? What debugging can I do to understand why this query crashes ES? Is there anything I can tweak just to get by for a couple of weeks?