Node.close() gets stuck in the 'stopping' state

Jacob_Perkins · February 27, 2011, 3:16pm

We're using hadoop + the java bulk api to index our data using a one-
hop strategy. This has worked out great so far, see
http://github.com/infochimps/wonderdog. There's one major issue I'm
trying to understand:

After using an elasticsearch 'node' object (embedded into each hadoop
map task) the node needs to call the 'close' method. While this seems
like an easy fix, here's the problem:

About 40% of the time this works great. The shutdown procedure is a
finite state machine that looks like the following:

(stopping -> stopped -> closing -> closed).

Only when the node object goes through that entire procedure (thereby
severing all ties to elasticsearch) does the hadoop task commit and
complete successfully.

The other 60% of the time the node gets stuck in the (stopping)
state which ultimately results in the task (which completed
successfully mind you) to timeout and fail.

Now, in the current hackety version, the 'node' object itself does not
call close. Instead the node's client (the thing actually using the
open connection afaik) is closed. However, this is essentially a
meaningless operation since the 'node' maintains a persistent
connection. What this results in are 'rogue' hadoop processes that
trick elasticsearch into thinking there are many more 'nodes' than
there actually are. When enough rogue processes accumulate this causes
a 'too many open files' issue.

What is the node doing during the stopping phase and how can I tell
what's causing it to hang?

--jacob
@thedatachef

kimchy · February 27, 2011, 7:32pm

Is there a chance you can gist a thread dump of the task when its stuck? It will help seeing where exactly its stuck. Which version are you using?
On Sunday, February 27, 2011 at 5:16 PM, Jacob Perkins wrote:

We're using hadoop + the java bulk api to index our data using a one-
hop strategy. This has worked out great so far, see
GitHub - infochimps/wonderdog: Wonderdog is now at https://github.com/infochimps-labs/wonderdog) ElasticSearch and Hadoop and beautiful bouncy elephant love.. There's one major issue I'm
trying to understand:

After using an elasticsearch 'node' object (embedded into each hadoop
map task) the node needs to call the 'close' method. While this seems
like an easy fix, here's the problem:

About 40% of the time this works great. The shutdown procedure is a
finite state machine that looks like the following:

(stopping -> stopped -> closing -> closed).

Only when the node object goes through that entire procedure (thereby
severing all ties to elasticsearch) does the hadoop task commit and
complete successfully.

The other 60% of the time the node gets stuck in the (stopping)
state which ultimately results in the task (which completed
successfully mind you) to timeout and fail.

Now, in the current hackety version, the 'node' object itself does not
call close. Instead the node's client (the thing actually using the
open connection afaik) is closed. However, this is essentially a
meaningless operation since the 'node' maintains a persistent
connection. What this results in are 'rogue' hadoop processes that
trick elasticsearch into thinking there are many more 'nodes' than
there actually are. When enough rogue processes accumulate this causes
a 'too many open files' issue.

What is the node doing during the stopping phase and how can I tell
what's causing it to hang?

--jacob
@thedatachef

Topic		Replies	Views
Bug: Node.stop() is not stopping the threads started by Node.start() Elasticsearch	2	353	July 6, 2017
ES 1.7.3 Java API error when I close Node Elasticsearch	5	1820	July 5, 2017
Closing client on shutdown Elasticsearch	5	4418	July 6, 2017
Help for removing a crashed node? Elasticsearch	5	1054	July 5, 2017
Node stuck in cluster after it crashed Elasticsearch	2	336	July 6, 2017

Node.close() gets stuck in the 'stopping' state

Related topics