Large heap usage with each node

That's the problem then. Each shard is a lucene instance, it requires resources to maintain.

Reduce that to a reasonable number and you should see better resources usage.