Why my bulk request not using all cpus of others node?

ashishtiwari1993 · May 9, 2019, 6:05am

Hi Team,

Below is my output of thread pool. I dont know why my only one node'cpu is being use for bulk request.

curl -X GET "myhost:9200/_cat/thread_pool/?v"
node_name     name                active queue rejected
node1       bulk                    24  4522        0
node1       fetch_shard_started      0     0        0
node1       fetch_shard_store        0     0        0
node1       flush                    1     0        0
node1       force_merge              0     0        0
node1       generic                  0     0        0
node1       get                      0     0        0
node1       index                    0     0        0
node1       listener                 0     0        0
node1       management               1     0        0
node1       refresh                  2     0        0
node1       search                  21     0        0
node1       snapshot                 0     0        0
node1       warmer                   0     0        0
node2       bulk                     0     0        0
node2       fetch_shard_started      0     0        0
node2       fetch_shard_store        0     0        0
node2       flush                    0     0        0
node2       force_merge              0     0        0
node2       generic                  0     0        0
node2       get                      0     0        0
node2       index                    0     0        0
node2       listener                 0     0        0
node2       management               1     0        0
node2       refresh                  0     0        0
node2       search                   0     0        0
node2       snapshot                 0     0        0
node2       warmer                   0     0        0
node3       bulk                     3     0        0
node3       fetch_shard_started      0     0        0
node3       fetch_shard_store        0     0        0
node3       flush                    0     0        0
node3       force_merge              0     0        0
node3       generic                  0     0        0
node3       get                      0     0        0
node3       index                    0     0        0
node3       listener                 0     0        0
node3       management               1     0        0
node3       refresh                  0     0        0
node3       search                   0     0        0
node3       snapshot                 0     0        0
node3       warmer                   0     0        0

Christian_Dahlqvist · May 9, 2019, 6:08am

Are you sending all bulk requests to just one node? Are the nodes configured the same?

ashishtiwari1993 · May 9, 2019, 6:21am

In connection i have mention all nodes. So it is making connection to all nodes. All three nodes are master nodes with no replication. By default it is round robin selector. I think it should distribute the traffic to all nodes. But it is not happening which causing my write slow.

I am using php-sdk for bulk write.

DavidTurner · May 9, 2019, 9:20am

Can you share the output of GET _cat/shards?

ashishtiwari1993 · May 9, 2019, 11:00am

Here is my _cat/shards

index3-201903           4 p STARTED  11453505    3.9gb ip  node1
index3-201903           1 p STARTED  11453637      4gb ip  node2
index3-201903           2 p STARTED  11448352      4gb ip  node3
index3-201903           3 p STARTED  11444246      4gb ip  node2
index3-201903           0 p STARTED  11445181      4gb ip  node3
index7-201903           4 p STARTED   9324240    3.2gb ip  node1
index7-201903           1 p STARTED   9320367    3.1gb ip  node2
index7-201903           2 p STARTED   9323628    3.2gb ip  node1
index7-201903           3 p STARTED   9324646    3.2gb ip  node2
index7-201903           0 p STARTED   9319057    3.2gb ip  node3
index1-201906           1 p STARTED         0     230b ip  node3
index1-201906           4 p STARTED         0     230b ip  node2
index1-201906           2 p STARTED         0     230b ip  node2
index1-201906           3 p STARTED         0     230b ip  node1
index1-201906           0 p STARTED         0     230b ip  node3
index9-201904           1 p STARTED   5946930    2.4gb ip  node3
index9-201904           4 p STARTED   5946084    2.4gb ip  node1
index9-201904           2 p STARTED   5946722    2.5gb ip  node1
index9-201904           3 p STARTED   5945667    2.4gb ip  node2
index9-201904           0 p STARTED   5946815    2.5gb ip  node3
index2-201902           1 p STARTED  12096140    4.8gb ip  node1
index2-201902           4 p STARTED  12091361    4.6gb ip  node2
index2-201902           2 p STARTED  12091717    4.8gb ip  node2
index2-201902           3 p STARTED  12100938    4.8gb ip  node1
index2-201902           0 p STARTED  12091226    4.7gb ip  node3
index9-201906           1 p STARTED         0     230b ip  node1
index9-201906           4 p STARTED         0     230b ip  node2
index9-201906           2 p STARTED         0     230b ip  node3
index9-201906           3 p STARTED         0     230b ip  node1
index9-201906           0 p STARTED         0     230b ip  node3
index4-201903           1 p STARTED  17308612    5.4gb ip  node3
index4-201903           4 p STARTED  17298300    5.5gb ip  node2
index4-201903           2 p STARTED  17305067    5.2gb ip  node2
index4-201903           3 p STARTED  17296242    5.5gb ip  node1
index4-201903           0 p STARTED  17297445    5.2gb ip  node3
index6-201905           4 p STARTED   8241309    2.1gb ip  node1
index6-201905           1 p STARTED   8236951    2.3gb ip  node2
index6-201905           2 p STARTED   8236060    2.6gb ip  node3
index6-201905           3 p STARTED   8237484    2.1gb ip  node2
index6-201905           0 p STARTED   8238824    2.4gb ip  node3
index7-201905           1 p STARTED   3291973    1.6gb ip  node3
index7-201905           4 p STARTED   3287984    1.2gb ip  node2
index7-201905           2 p STARTED   3291085      1gb ip  node2
index7-201905           3 p STARTED   3293073      1gb ip  node1
index7-201905           0 p STARTED   3290959    1.2gb ip  node3
index4-201906           4 p STARTED         0     230b ip  node1
index4-201906           1 p STARTED         0     230b ip  node2
index4-201906           2 p STARTED         0     230b ip  node3
index4-201906           3 p STARTED         0     230b ip  node2
index4-201906           0 p STARTED         0     230b ip  node3
index9-201902           4 p STARTED   8548210    2.9gb ip  node1
index9-201902           1 p STARTED   8559473    2.9gb ip  node2
index9-201902           2 p STARTED   8554209    2.8gb ip  node3
index9-201902           3 p STARTED   8550218    2.8gb ip  node2
index9-201902           0 p STARTED   8555954    2.9gb ip  node3
index9-201905           1 p STARTED   2189115    1.3gb ip  node3
index9-201905           4 p STARTED   2186322  849.1mb ip  node2
index9-201905           2 p STARTED   2187710      1gb ip  node2
index9-201905           3 p STARTED   2187424  904.5mb ip  node1
index9-201905           0 p STARTED   2187248  969.9mb ip  node3
index2-201904           1 p STARTED  24682711   10.7gb ip  node1
index2-201904           4 p STARTED  24691660   10.7gb ip  node2
index2-201904           2 p STARTED  24688984   10.9gb ip  node3
index2-201904           3 p STARTED  24670928   10.8gb ip  node1
index2-201904           0 p STARTED  24690389     11gb ip  node3

DavidTurner · May 9, 2019, 11:02am

Hmm ok that looks like a sensible spread of shards across the nodes.

Can you try sending traffic just to node2 and see if this moves the load away from node1 or not?

ashishtiwari1993 · May 9, 2019, 11:35am

Hi David,

Still All the traffic is routing to one node only.

Christian_Dahlqvist · May 9, 2019, 11:44am

Does the PHP client connect to the nodes in the order they are specified, leading to all threads using the first node in the list?

DavidTurner · May 9, 2019, 11:49am

Sure, but which node? If you're sending traffic to node2 is it still getting stuck on node1?

ashishtiwari1993 · May 9, 2019, 11:55am

yes exactly. All traffic is routing to node 1 only. Some active thread i have seen while continuously hitting _cat/threa_pool But most of the time node1 only receiving all traffic. My writing is continuously going on.

node_name                name                active queue rejected
node1                  bulk                    24  6753        0
node1                  fetch_shard_started      0     0        0
node1                  fetch_shard_store        0     0        0
node1                  flush                    0     0        0
node1                  force_merge              0     0        0
node1                  generic                  0     0        0
node1                  get                      0     0        0
node1                  index                    0     0        0
node1                  listener                 0     0        0
node1                  management               2     0        0
node1                  refresh                  2     0        0
node1                  search                   0     0        0
node1                  snapshot                 0     0        0
node1                  warmer                   0     0        0
node2                  bulk                     3     0        0
node2                  fetch_shard_started      0     0        0
node2                  fetch_shard_store        0     0        0
node2                  flush                    0     0        0
node2                  force_merge              0     0        0
node2                  generic                  0     0        0
node2                  get                      0     0        0
node2                  index                    0     0        0
node2                  listener                 0     0        0
node2                  management               1     0        0
node2                  refresh                  0     0        0
node2                  search                   0     0        0
node2                  snapshot                 0     0        0
node2                  warmer                   0     0        0
node3                  bulk                    10     0        0
node3                  fetch_shard_started      0     0        0
node3                  fetch_shard_store        0     0        0
node3                  flush                    0     0        0
node3                  force_merge              0     0        0
node3                  generic                  0     0        0
node3                  get                      0     0        0
node3                  index                    0     0        0
node3                  listener                 0     0        0
node3                  management               2     0        0
node3                  refresh                  0     0        0
node3                  search                   0     0        0
node3                  snapshot                 0     0        0
node3                  warmer                   0     0        0

DavidTurner · May 9, 2019, 12:08pm

I suspect there's something wrong with node1 causing it to process traffic much slower than the other two nodes.

Can you run the following command:

GET /_nodes/stats?filter_path=nodes.*.name,nodes.*.indices.indexing.index_total

Then do some indexing for a while and finally run the same command again:

GET /_nodes/stats?filter_path=nodes.*.name,nodes.*.indices.indexing.index_total

This will tell us whether that node is really seeing more traffic than the other two.

ashishtiwari1993 · May 9, 2019, 12:38pm

Hi david,
Here is response:

curl -XGET "myhost:9200/_nodes/stats?filter_path=nodes.*.name,nodes.*.indices.indexing.index_total&pretty"
{
  "nodes" : {
	"oUIhmmUZRL-adjPXWQjx5Q" : {
	  "name" : "node3",
	  "indices" : {
	    "indexing" : {
	      "index_total" : 4345928590
	    }
	  }
	},
	"9VMMk-kRRjWKDKB92XGTbA" : {
	  "name" : "node1",
	  "indices" : {
	    "indexing" : {
	      "index_total" : 123063502
	    }
	  }
	},
	"ybo_Txv9RICBosQ4QflRqw" : {
	  "name" : "node2",
	  "indices" : {
	    "indexing" : {
	      "index_total" : 4006695200
	    }
	  }
	}
  }
}

The number of node2 and node3 more then node 1. I guess this is total indexing number since uptime. And i recently added node1 (2 days ago). This is might be reason of less index_total

DavidTurner · May 9, 2019, 12:48pm

That's why I asked you to run that command twice.

ashishtiwari1993 · May 9, 2019, 12:51pm

M sorry david, My bad. Here is another response:

	curl -XGET "myhost:9200/_nodes/stats?filter_path=nodes.*.name,nodes.*.indices.indexing.index_total&pretty"
	{
		"nodes" : {
			"oUIhmmUZRL-adjPXWQjx5Q" : {
			  "name" : "node3",
			  "indices" : {
			    "indexing" : {
			      "index_total" : 4348430999
			    }
			  }
			},
			"9VMMk-kRRjWKDKB92XGTbA" : {
			  "name" : "node1",
			  "indices" : {
			    "indexing" : {
			      "index_total" : 125461151
			    }
			  }
			},
			"ybo_Txv9RICBosQ4QflRqw" : {
			  "name" : "node2",
			  "indices" : {
			    "indexing" : {
			      "index_total" : 4009182561
			    }
			  }
			}
		}
	}

DavidTurner · May 9, 2019, 12:56pm

Thanks, now we look at the differences:

node1:  125461151- 123063502 = 2397649
node2: 4009182561-4006695200 = 2487361
node3: 4348430999-4345928590 = 2502409

So it looks like node1 is actually handling slightly less traffic than the other two nodes. This does suggest that there's something different about that node.

ashishtiwari1993 · May 9, 2019, 1:04pm

Thanks David,

That makes sense but whenever i look to thread_pool it showing only node1 is active and shows some requests in active .

As per our analysis with indexing. It seems indexing rate is equal on all three nodes. But i am not sure about thread pools because in my monitoring graph also it shows highly bulk thread is using by only one node1.

Also one important point:

All three nodes's load average is different:

node1: 46.33, 43.81, 42.94
node2: 8.38, 8.40, 8.41
node3: 2.88, 3.35, 3.63

I also make sure there is no other service is running on node1.

Also whenever i increase my bulk write threads It increase my queue number of thread_pool of node1 only.

DavidTurner · May 9, 2019, 1:07pm

That is consistent with node1 processing indexing requests much slower than the other two nodes.

That could be because node1 has a much slower disk than the other two nodes.

ashishtiwari1993 · May 9, 2019, 1:26pm

Thanks David for explaination .

Hmm It might be. But i am running all three nodes on SATA disk. I know SSD is recommended. So what kind of disk slow can be possible according to you ?

DavidTurner · May 9, 2019, 2:27pm

I don't know, sorry, I'm not really in a position to diagnose performance issues in your IO subsystem.

ashishtiwari1993 · May 10, 2019, 6:14am

Okay david no problem. Thanks for you valuable answer

Topic		Replies	Views
Issue with a single node in cluster seemingly doing all the bulk indexing! Elasticsearch	3	1142	July 5, 2017
Bulk insert requests not balanced across cluster Elasticsearch	4	579	July 14, 2017
How does Elasticsearch use ThreadPool bulk queues in a cluster? Elasticsearch	3	1641	July 5, 2017
Bulk indexing requests are mostly queued on one node in the cluster Elasticsearch	3	555	December 28, 2020
Bulk Request Handling - Requests being handled by single node Elasticsearch	10	1040	August 21, 2019

Why my bulk request not using all cpus of others node?

Related topics