Hey,
So we had this cluster of 3 nodes, 2x32GB high mem/high cpu on ec2 + a
smaller instance. No matter what we did, it would eventually run out of
memory and lock up. So after a week of not sleeping keeping that cluster up
with our bare hands, getting failed shards, etc, we decided to rebuild it
from scratch, add routing, and limit time ranges to the closest hour,
reindexing everything.
It's running now on a single node, 64Gb, the biggest ec2 node basically. It
does the exact same thing: slowly builds up heap until it reaches the
maximum allocated, and then it locks up. Doesn't respond to shutdown, I
always have to kill -9 it, fix indexes with lucene checker and restart it.
The confis is roughly:
index:
store:
type: mmapfs
fs:
mmapfs:
enabled: true
cache:
field:
type: soft
expire: 30s
max_size: 1000
refresh_interval: 60s
bootstrap:
mlockall: true
I've also attached a stacktrace of when it was fully locked up (JSTACK) and
now (JSTACK2). It looks like it spends most of its time indexing. But we
are not indexing that many documents. Maybe 20 a sec. On a machine that
size, it's nothing.
Cluster is 3 indexes, 2x32 + 1x64 shards. Index size now is 100m, 600m,
900m (since we were reindexing). It doesn't take it long to run out of
memory (though it shows no error, just locks up). Which means I basically
have to stare at bigdesk and reboot it when it comes close.
Any help would be greatly appreciated. It's been more than a week of
absolute hell.
--