Hi Team,
Below is my output of thread pool. I dont know why my only one node'cpu is being use for bulk request.
curl -X GET "myhost:9200/_cat/thread_pool/?v"
node_name name active queue rejected
node1 bulk 24 4522 0
node1 fetch_shard_started 0 0 0
node1 fetch_shard_store 0 0 0
node1 flush 1 0 0
node1 force_merge 0 0 0
node1 generic 0 0 0
node1 get 0 0 0
node1 index 0 0 0
node1 listener 0 0 0
node1 management 1 0 0
node1 refresh 2 0 0
node1 search 21 0 0
node1 snapshot 0 0 0
node1 warmer 0 0 0
node2 bulk 0 0 0
node2 fetch_shard_started 0 0 0
node2 fetch_shard_store 0 0 0
node2 flush 0 0 0
node2 force_merge 0 0 0
node2 generic 0 0 0
node2 get 0 0 0
node2 index 0 0 0
node2 listener 0 0 0
node2 management 1 0 0
node2 refresh 0 0 0
node2 search 0 0 0
node2 snapshot 0 0 0
node2 warmer 0 0 0
node3 bulk 3 0 0
node3 fetch_shard_started 0 0 0
node3 fetch_shard_store 0 0 0
node3 flush 0 0 0
node3 force_merge 0 0 0
node3 generic 0 0 0
node3 get 0 0 0
node3 index 0 0 0
node3 listener 0 0 0
node3 management 1 0 0
node3 refresh 0 0 0
node3 search 0 0 0
node3 snapshot 0 0 0
node3 warmer 0 0 0
Are you sending all bulk requests to just one node? Are the nodes configured the same?
In connection i have mention all nodes. So it is making connection to all nodes. All three nodes are master nodes with no replication. By default it is round robin selector. I think it should distribute the traffic to all nodes. But it is not happening which causing my write slow.
I am using php-sdk for bulk write.
Can you share the output of GET _cat/shards
?
Here is my _cat/shards
index3-201903 4 p STARTED 11453505 3.9gb ip node1
index3-201903 1 p STARTED 11453637 4gb ip node2
index3-201903 2 p STARTED 11448352 4gb ip node3
index3-201903 3 p STARTED 11444246 4gb ip node2
index3-201903 0 p STARTED 11445181 4gb ip node3
index7-201903 4 p STARTED 9324240 3.2gb ip node1
index7-201903 1 p STARTED 9320367 3.1gb ip node2
index7-201903 2 p STARTED 9323628 3.2gb ip node1
index7-201903 3 p STARTED 9324646 3.2gb ip node2
index7-201903 0 p STARTED 9319057 3.2gb ip node3
index1-201906 1 p STARTED 0 230b ip node3
index1-201906 4 p STARTED 0 230b ip node2
index1-201906 2 p STARTED 0 230b ip node2
index1-201906 3 p STARTED 0 230b ip node1
index1-201906 0 p STARTED 0 230b ip node3
index9-201904 1 p STARTED 5946930 2.4gb ip node3
index9-201904 4 p STARTED 5946084 2.4gb ip node1
index9-201904 2 p STARTED 5946722 2.5gb ip node1
index9-201904 3 p STARTED 5945667 2.4gb ip node2
index9-201904 0 p STARTED 5946815 2.5gb ip node3
index2-201902 1 p STARTED 12096140 4.8gb ip node1
index2-201902 4 p STARTED 12091361 4.6gb ip node2
index2-201902 2 p STARTED 12091717 4.8gb ip node2
index2-201902 3 p STARTED 12100938 4.8gb ip node1
index2-201902 0 p STARTED 12091226 4.7gb ip node3
index9-201906 1 p STARTED 0 230b ip node1
index9-201906 4 p STARTED 0 230b ip node2
index9-201906 2 p STARTED 0 230b ip node3
index9-201906 3 p STARTED 0 230b ip node1
index9-201906 0 p STARTED 0 230b ip node3
index4-201903 1 p STARTED 17308612 5.4gb ip node3
index4-201903 4 p STARTED 17298300 5.5gb ip node2
index4-201903 2 p STARTED 17305067 5.2gb ip node2
index4-201903 3 p STARTED 17296242 5.5gb ip node1
index4-201903 0 p STARTED 17297445 5.2gb ip node3
index6-201905 4 p STARTED 8241309 2.1gb ip node1
index6-201905 1 p STARTED 8236951 2.3gb ip node2
index6-201905 2 p STARTED 8236060 2.6gb ip node3
index6-201905 3 p STARTED 8237484 2.1gb ip node2
index6-201905 0 p STARTED 8238824 2.4gb ip node3
index7-201905 1 p STARTED 3291973 1.6gb ip node3
index7-201905 4 p STARTED 3287984 1.2gb ip node2
index7-201905 2 p STARTED 3291085 1gb ip node2
index7-201905 3 p STARTED 3293073 1gb ip node1
index7-201905 0 p STARTED 3290959 1.2gb ip node3
index4-201906 4 p STARTED 0 230b ip node1
index4-201906 1 p STARTED 0 230b ip node2
index4-201906 2 p STARTED 0 230b ip node3
index4-201906 3 p STARTED 0 230b ip node2
index4-201906 0 p STARTED 0 230b ip node3
index9-201902 4 p STARTED 8548210 2.9gb ip node1
index9-201902 1 p STARTED 8559473 2.9gb ip node2
index9-201902 2 p STARTED 8554209 2.8gb ip node3
index9-201902 3 p STARTED 8550218 2.8gb ip node2
index9-201902 0 p STARTED 8555954 2.9gb ip node3
index9-201905 1 p STARTED 2189115 1.3gb ip node3
index9-201905 4 p STARTED 2186322 849.1mb ip node2
index9-201905 2 p STARTED 2187710 1gb ip node2
index9-201905 3 p STARTED 2187424 904.5mb ip node1
index9-201905 0 p STARTED 2187248 969.9mb ip node3
index2-201904 1 p STARTED 24682711 10.7gb ip node1
index2-201904 4 p STARTED 24691660 10.7gb ip node2
index2-201904 2 p STARTED 24688984 10.9gb ip node3
index2-201904 3 p STARTED 24670928 10.8gb ip node1
index2-201904 0 p STARTED 24690389 11gb ip node3
Hmm ok that looks like a sensible spread of shards across the nodes.
Can you try sending traffic just to node2
and see if this moves the load away from node1
or not?
Hi David,
Still All the traffic is routing to one node only.
Does the PHP client connect to the nodes in the order they are specified, leading to all threads using the first node in the list?
Sure, but which node? If you're sending traffic to node2
is it still getting stuck on node1
?
yes exactly. All traffic is routing to node 1 only. Some active thread i have seen while continuously hitting _cat/threa_pool
But most of the time node1 only receiving all traffic. My writing is continuously going on.
node_name name active queue rejected
node1 bulk 24 6753 0
node1 fetch_shard_started 0 0 0
node1 fetch_shard_store 0 0 0
node1 flush 0 0 0
node1 force_merge 0 0 0
node1 generic 0 0 0
node1 get 0 0 0
node1 index 0 0 0
node1 listener 0 0 0
node1 management 2 0 0
node1 refresh 2 0 0
node1 search 0 0 0
node1 snapshot 0 0 0
node1 warmer 0 0 0
node2 bulk 3 0 0
node2 fetch_shard_started 0 0 0
node2 fetch_shard_store 0 0 0
node2 flush 0 0 0
node2 force_merge 0 0 0
node2 generic 0 0 0
node2 get 0 0 0
node2 index 0 0 0
node2 listener 0 0 0
node2 management 1 0 0
node2 refresh 0 0 0
node2 search 0 0 0
node2 snapshot 0 0 0
node2 warmer 0 0 0
node3 bulk 10 0 0
node3 fetch_shard_started 0 0 0
node3 fetch_shard_store 0 0 0
node3 flush 0 0 0
node3 force_merge 0 0 0
node3 generic 0 0 0
node3 get 0 0 0
node3 index 0 0 0
node3 listener 0 0 0
node3 management 2 0 0
node3 refresh 0 0 0
node3 search 0 0 0
node3 snapshot 0 0 0
node3 warmer 0 0 0
I suspect there's something wrong with node1
causing it to process traffic much slower than the other two nodes.
Can you run the following command:
GET /_nodes/stats?filter_path=nodes.*.name,nodes.*.indices.indexing.index_total
Then do some indexing for a while and finally run the same command again:
GET /_nodes/stats?filter_path=nodes.*.name,nodes.*.indices.indexing.index_total
This will tell us whether that node is really seeing more traffic than the other two.
Hi david,
Here is response:
curl -XGET "myhost:9200/_nodes/stats?filter_path=nodes.*.name,nodes.*.indices.indexing.index_total&pretty"
{
"nodes" : {
"oUIhmmUZRL-adjPXWQjx5Q" : {
"name" : "node3",
"indices" : {
"indexing" : {
"index_total" : 4345928590
}
}
},
"9VMMk-kRRjWKDKB92XGTbA" : {
"name" : "node1",
"indices" : {
"indexing" : {
"index_total" : 123063502
}
}
},
"ybo_Txv9RICBosQ4QflRqw" : {
"name" : "node2",
"indices" : {
"indexing" : {
"index_total" : 4006695200
}
}
}
}
}
The number of node2 and node3 more then node 1. I guess this is total indexing number since uptime. And i recently added node1 (2 days ago). This is might be reason of less index_total
That's why I asked you to run that command twice.
M sorry david, My bad. Here is another response:
curl -XGET "myhost:9200/_nodes/stats?filter_path=nodes.*.name,nodes.*.indices.indexing.index_total&pretty"
{
"nodes" : {
"oUIhmmUZRL-adjPXWQjx5Q" : {
"name" : "node3",
"indices" : {
"indexing" : {
"index_total" : 4348430999
}
}
},
"9VMMk-kRRjWKDKB92XGTbA" : {
"name" : "node1",
"indices" : {
"indexing" : {
"index_total" : 125461151
}
}
},
"ybo_Txv9RICBosQ4QflRqw" : {
"name" : "node2",
"indices" : {
"indexing" : {
"index_total" : 4009182561
}
}
}
}
}
Thanks, now we look at the differences:
node1: 125461151- 123063502 = 2397649
node2: 4009182561-4006695200 = 2487361
node3: 4348430999-4345928590 = 2502409
So it looks like node1
is actually handling slightly less traffic than the other two nodes. This does suggest that there's something different about that node.
Thanks David,
That makes sense but whenever i look to thread_pool
it showing only node1
is active and shows some requests in active
.
As per our analysis with indexing. It seems indexing rate is equal on all three nodes. But i am not sure about thread pools because in my monitoring graph also it shows highly bulk thread is using by only one node1.
Also one important point:
All three nodes's load average
is different:
node1: 46.33, 43.81, 42.94
node2: 8.38, 8.40, 8.41
node3: 2.88, 3.35, 3.63
I also make sure there is no other service is running on node1.
Also whenever i increase my bulk write threads It increase my queue number of thread_pool of node1 only.
That is consistent with node1
processing indexing requests much slower than the other two nodes.
That could be because node1
has a much slower disk than the other two nodes.
Thanks David for explaination .
Hmm It might be. But i am running all three nodes on SATA disk. I know SSD is recommended. So what kind of disk slow can be possible according to you ?
I don't know, sorry, I'm not really in a position to diagnose performance issues in your IO subsystem.
Okay david no problem. Thanks for you valuable answer