Change threadpool queue size for batch process

I wanted to do some batch process such as sequentially conducting 100 Million query. However, it will reject at some point. Can I just change the thread pool search queue size to -1(unbounded) ?
If I can, do I have to restart the cluster and change the setting for each node I have?

Why not just queue up queries at the application layer? If Elasticsearch is rejecting requests, it is generally for a good reason. Increasing the queue size in Elasticsearch will just result in increased memory usage and longer latencies as the size of the queue does not affect the query throughput (unless all the additional memory used actually slows it down).

Hi Christian,

Thanks for your reply. As you suggested, I queue up at my server side. However, when I set up the cluster, I found by adding more node, the time for batching query 10 thousand records also increased.
These nodes settings are in default except for the role of the node and discovery IP. I was expected that the time will decrease linearly by adding more node to the cluster.

I checked the CPU utilization of each node, I found even when I conduct these queries, it almost remains the same.

Here is my set up: 43G data, 4 shards, no replica, 4 cores, 16G RAM. Virtual Machine

I am trying to use elasticsearch to do my thesis so I really appreciate your suggestions regarding this.

Zhengcong

How many shards are you querying? How many nodes do you have in the cluster? How many parallel queries are you running?

My cluster always has 3 master node, and 1 client node, data node varies from one to four so that I could make the comparison.

I totally have 4 shards, right now I have 4 dedicated data note, so each data node has only one shard.

I just send these queries using FOR loop and track the 10000th one come back time using Callback.

How many of those shards are primary shards? Have you tried sending queries in parallel? As each shard is processed using a single thread per query, you will at most (assuming all shards are primary shards) use 4 cores (your number of shards) if you send all queries sequentially. It may therefore be that your setup is not able to benefit from the greater parallelism that a larger cluster can provide.

Thanks for your patience.

These four are all primary shards, no replica.

I checked the documentation, in my case, each node could handle 4 (cores) *1.5 = 6 thread at the same time. So with 4 nodes, it should be 6 *4 = 24, am I right?

Do I need to change anything regarding the client node in order to benefit from these 4 data nodes?

There is no such thing as an unbounded queue, instead you will eventually run out of heap space. That is, all queues are at least bounded by the heap space available to queue requests. Put differently: using an unbounded queue is dangerous and some day the places where "unbounded" queues are used within Elasticsearch will be removed and the ability to set a queue as "unbounded" will be removed.

1 Like

Here is my code for sending out these requests:
I send out 5000 requests:

for (var i = 1; i < 5000; i++){

 request.get(encodeURI(urls[i].split('\r').join('')), (error, response, body) => {
		 processedCount = processedCount +1;
		 console.log(processedCount)
		 let json = JSON.parse(body);
		console.log(json.hits.hits)			 
		 if(processedCount == 4999){
		 elapsed  = new Date().getTime() / 1000 - timeStart
		 console.log(elapsed.toString())
	}		
 })

}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.