Large heap usage with each node

So now i'm at at 352 shards over 44 indexes and I still see 12GB of heap usage per node. I'm thinking the only way I can really reach passed 1 billion docs on the 4 nodes is to add extra node per physical machine since I have the horse power., But keep 4 shards + 1 replica config.

Btw i did some tests with my daily averages and I can even use single shard per index. But would that affect parallelism? I can also maybe provide yourkit snaphshot of what is eating up the ram...