How to find bottleneck

Hi guys
I was doing some perf testing on my es 1.7.1 cluster and found something interesting
cluster consist of:
5 data nodes - no http, cant be master 3 masters 2 client nodes - cant be master, no data

I`m pushing events from 20 machines with transport client (not with bulk API) and getting long response time from ES to ingest events (cca 1000ms per message)

there are cca 100 active indices (to which we push data) and all of them have 5 shards

when I check index thread pool I can see mostly one node doing indexing
{
"host": "es-test-02",
"ip": "10.0.1.32",
"index.size": "32",
"index.active": "0",
"index.queue": "0",
"index.largest": "32"
},
{
"host": "es-test-03",
"ip": "10.0.1.33",
"index.size": "32",
"index.active": "32",
"index.queue": "490",
"index.largest": "32"
},
{
"host": "es-test-04",
"ip": "10.0.1.34",
"index.size": "32",
"index.active": "0",
"index.queue": "0",
"index.largest": "32"
},
{
"host": "es-test-01",
"ip": "10.0.1.31",
"index.size": "32",
"index.active": "0",
"index.queue": "2",
"index.largest": "32"
},
{
"host": "es-test-05",
"ip": "10.0.1.35",
"index.size": "32",
"index.active": "0",
"index.queue": "0",
"index.largest": "32"
}
in documentation I found that transport client should round robit requests by default..

also hot threads for this node says that es is spending more than 90% of the time doing lucene merge thread

I can see much higher cpu usage and longer gc young on node-03 but everything else seems normal -not hitting ios on disks, not hitting network max throughput etc..

what else should I monitor to find out why are responses so long and why only one node do all the indexing?
thanks