Node.close() gets stuck in the 'stopping' state


(Jacob Perkins) #1

We're using hadoop + the java bulk api to index our data using a one-
hop strategy. This has worked out great so far, see
http://github.com/infochimps/wonderdog. There's one major issue I'm
trying to understand:

After using an elasticsearch 'node' object (embedded into each hadoop
map task) the node needs to call the 'close' method. While this seems
like an easy fix, here's the problem:

  1. About 40% of the time this works great. The shutdown procedure is a
    finite state machine that looks like the following:

(stopping -> stopped -> closing -> closed).

Only when the node object goes through that entire procedure (thereby
severing all ties to elasticsearch) does the hadoop task commit and
complete successfully.

  1. The other 60% of the time the node gets stuck in the (stopping)
    state which ultimately results in the task (which completed
    successfully mind you) to timeout and fail.

Now, in the current hackety version, the 'node' object itself does not
call close. Instead the node's client (the thing actually using the
open connection afaik) is closed. However, this is essentially a
meaningless operation since the 'node' maintains a persistent
connection. What this results in are 'rogue' hadoop processes that
trick elasticsearch into thinking there are many more 'nodes' than
there actually are. When enough rogue processes accumulate this causes
a 'too many open files' issue.

What is the node doing during the stopping phase and how can I tell
what's causing it to hang?

--jacob
@thedatachef


(Shay Banon) #2

Is there a chance you can gist a thread dump of the task when its stuck? It will help seeing where exactly its stuck. Which version are you using?
On Sunday, February 27, 2011 at 5:16 PM, Jacob Perkins wrote:

We're using hadoop + the java bulk api to index our data using a one-
hop strategy. This has worked out great so far, see
http://github.com/infochimps/wonderdog. There's one major issue I'm
trying to understand:

After using an elasticsearch 'node' object (embedded into each hadoop
map task) the node needs to call the 'close' method. While this seems
like an easy fix, here's the problem:

  1. About 40% of the time this works great. The shutdown procedure is a
    finite state machine that looks like the following:

(stopping -> stopped -> closing -> closed).

Only when the node object goes through that entire procedure (thereby
severing all ties to elasticsearch) does the hadoop task commit and
complete successfully.

  1. The other 60% of the time the node gets stuck in the (stopping)
    state which ultimately results in the task (which completed
    successfully mind you) to timeout and fail.

Now, in the current hackety version, the 'node' object itself does not
call close. Instead the node's client (the thing actually using the
open connection afaik) is closed. However, this is essentially a
meaningless operation since the 'node' maintains a persistent
connection. What this results in are 'rogue' hadoop processes that
trick elasticsearch into thinking there are many more 'nodes' than
there actually are. When enough rogue processes accumulate this causes
a 'too many open files' issue.

What is the node doing during the stopping phase and how can I tell
what's causing it to hang?

--jacob
@thedatachef


(system) #3