Indefinite hang when adding nodes?


(Derek Wollenstein) #1
I'm not sure if I'm doing things wrong, but I seem to be

encountering some problems adding registered addresses to a
TransportClient. The ScheduledConnectNodeSampler class inside of
TransportClientNodesService.java would block indefinitely trying to
refresh data from a given node. If that node happened to be dead at
the time, the entire client would come to a halt. I've modified this
to take a timeout in milliseconds to use when retrieving nodesInfo and
gotten this to run more smoothly, but I don't know if this indicates
that I'm doing something wrong. Thanks for your help

--Derek


(Eric Mill) #2

I wonder if this would explain the strange hanging behavior I've been seeing
intermittently when trying to index documents. I can make it go away by
blowing away the data repo and starting over, and it may not reoccur for a
long time. The next time it happens, I'll start investigating the node
endpoints to see if it's happening while a node is dead.

-- Eric

On Wed, Jun 8, 2011 at 7:02 PM, Derek Wollenstein derek@klout.com wrote:

I'm not sure if I'm doing things wrong, but I seem to be
encountering some problems adding registered addresses to a
TransportClient. The ScheduledConnectNodeSampler class inside of
TransportClientNodesService.java would block indefinitely trying to
refresh data from a given node. If that node happened to be dead at
the time, the entire client would come to a halt. I've modified this
to take a timeout in milliseconds to use when retrieving nodesInfo and
gotten this to run more smoothly, but I don't know if this indicates
that I'm doing something wrong. Thanks for your help

--Derek


(fashionalwallet) #3
  • deleted -

(Shay Banon) #4

Its not really indefinite, I am not sure why it hangs, but eventually it will fail on closed socket. Or, if it tries to connect to a broken socket, under some OS, it will take time to identify that the socket is not there to connect to (by default, it has 30s timeout). Do you really see it as indefinite?

On Thursday, June 9, 2011 at 2:02 AM, Derek Wollenstein wrote:

I'm not sure if I'm doing things wrong, but I seem to be
encountering some problems adding registered addresses to a
TransportClient. The ScheduledConnectNodeSampler class inside of
TransportClientNodesService.java would block indefinitely trying to
refresh data from a given node. If that node happened to be dead at
the time, the entire client would come to a halt. I've modified this
to take a timeout in milliseconds to use when retrieving nodesInfo and
gotten this to run more smoothly, but I don't know if this indicates
that I'm doing something wrong. Thanks for your help

--Derek


(Shay Banon) #5

Did not full understand your problem...

On Thursday, June 9, 2011 at 4:36 AM, Eric Mill wrote:

I wonder if this would explain the strange hanging behavior I've been seeing intermittently when trying to index documents. I can make it go away by blowing away the data repo and starting over, and it may not reoccur for a long time. The next time it happens, I'll start investigating the node endpoints to see if it's happening while a node is dead.

-- Eric

On Wed, Jun 8, 2011 at 7:02 PM, Derek Wollenstein <derek@klout.com (mailto:derek@klout.com)> wrote:

I'm not sure if I'm doing things wrong, but I seem to be
encountering some problems adding registered addresses to a
TransportClient. The ScheduledConnectNodeSampler class inside of
TransportClientNodesService.java would block indefinitely trying to
refresh data from a given node. If that node happened to be dead at
the time, the entire client would come to a halt. I've modified this
to take a timeout in milliseconds to use when retrieving nodesInfo and
gotten this to run more smoothly, but I don't know if this indicates
that I'm doing something wrong. Thanks for your help

--Derek


(Derek Wollenstein) #6

Shay -

When I say "indefinite", what I mean is that I wrote some code

that does the following (without the rest of the class):
Settings settings =
ImmutableSettings.settingsBuilder().put("cluster.name", clusterName).
.put(NetworkService.TcpSettings.TCP_CONNECT_TIMEOUT,
1).build();
TransportClient indexer = new TransportClient(settings);
String[] hosts = new String[] {"host1","host2"/* ,... */,
"hostN"};

        for (int i = 0; i < hosts.length; i++) {
            LOG.info(" host : " + hosts[i]);
            indexer = indexer.addTransportAddress(new

InetSocketTransportAddress(hosts[i], 9300));
}

If I run this I'll see the output
host: host1
host: host2
[... pause for several minutes ...]
And then I'll abort this with Ctrl+C. I've had this run fine
depending on connectivity, but if there's any pause at all it seems to
hang forever. I didn't get anything to improve without modifying the
ScheduledConnectNodesSampler to have an internal timeout

On Jun 9, 12:10 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Did not full understand your problem...

On Thursday, June 9, 2011 at 4:36 AM, Eric Mill wrote:

I wonder if this would explain the strange hanging behavior I've been seeing intermittently when trying to index documents. I can make it go away by blowing away the data repo and starting over, and it may not reoccur for a long time. The next time it happens, I'll start investigating the node endpoints to see if it's happening while a node is dead.

-- Eric

On Wed, Jun 8, 2011 at 7:02 PM, Derek Wollenstein <de...@klout.com (mailto:de...@klout.com)> wrote:

I'm not sure if I'm doing things wrong, but I seem to be
encountering some problems adding registered addresses to a
TransportClient. The ScheduledConnectNodeSampler class inside of
TransportClientNodesService.java would block indefinitely trying to
refresh data from a given node. If that node happened to be dead at
the time, the entire client would come to a halt. I've modified this
to take a timeout in milliseconds to use when retrieving nodesInfo and
gotten this to run more smoothly, but I don't know if this indicates
that I'm doing something wrong. Thanks for your help

--Derek


(Shay Banon) #7

When you set it to 1, it means 1 millisecond, can you try and set it to "1s"? Also, when you see this hang, can you issue a thread dump, lets see where its stuck (gist it please).

On Friday, June 10, 2011 at 12:08 AM, Derek Wollenstein wrote:

Shay -

When I say "indefinite", what I mean is that I wrote some code
that does the following (without the rest of the class):
Settings settings =
ImmutableSettings.settingsBuilder().put("cluster.name (http://cluster.name)", clusterName).
.put(NetworkService.TcpSettings.TCP_CONNECT_TIMEOUT,
1).build();
TransportClient indexer = new TransportClient(settings);
String[] hosts = new String[] {"host1","host2"/* ,... */,
"hostN"};

for (int i = 0; i < hosts.length; i++) {
LOG.info (http://LOG.info)(" host : " + hosts[i]);
indexer = indexer.addTransportAddress(new
InetSocketTransportAddress(hosts[i], 9300));
}

If I run this I'll see the output
host: host1
host: host2
[... pause for several minutes ...]
And then I'll abort this with Ctrl+C. I've had this run fine
depending on connectivity, but if there's any pause at all it seems to
hang forever. I didn't get anything to improve without modifying the
ScheduledConnectNodesSampler to have an internal timeout

On Jun 9, 12:10 pm, Shay Banon <shay.ba...@elasticsearch.com (http://elasticsearch.com)> wrote:

Did not full understand your problem...

On Thursday, June 9, 2011 at 4:36 AM, Eric Mill wrote:

I wonder if this would explain the strange hanging behavior I've been seeing intermittently when trying to index documents. I can make it go away by blowing away the data repo and starting over, and it may not reoccur for a long time. The next time it happens, I'll start investigating the node endpoints to see if it's happening while a node is dead.

-- Eric

On Wed, Jun 8, 2011 at 7:02 PM, Derek Wollenstein <de...@klout.com (mailto:de...@klout.com (http://klout.com))> wrote:

I'm not sure if I'm doing things wrong, but I seem to be
encountering some problems adding registered addresses to a
TransportClient. The ScheduledConnectNodeSampler class inside of
TransportClientNodesService.java would block indefinitely trying to
refresh data from a given node. If that node happened to be dead at
the time, the entire client would come to a halt. I've modified this
to take a timeout in milliseconds to use when retrieving nodesInfo and
gotten this to run more smoothly, but I don't know if this indicates
that I'm doing something wrong. Thanks for your help

--Derek


(Derek Wollenstein) #8

Shay-
I just got this to reproduce itself using default settings. This
basically is what happens when a previously connected node disappears
when using a local gateway, and unicast discovery. I attempted to
check for the same lock being waited on after running jstack,
sleeping, and running jstack again.

I've put the apparently hung thread in

Actually I'll go ahead and updated this to be the complete thread
dump.

Threads "elasticsearch[cached]-pool-1-thread-80",
"elasticsearch[cached]-pool-1-thread-79", "elasticsearch[cached]-
pool-1-thread-78", and at least one more are exhibiting this problem.

On Jun 9, 5:17 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

When you set it to 1, it means 1 millisecond, can you try and set it to "1s"? Also, when you see this hang, can you issue a thread dump, lets see where its stuck (gist it please).

On Friday, June 10, 2011 at 12:08 AM, Derek Wollenstein wrote:

Shay -

When I say "indefinite", what I mean is that I wrote some code
that does the following (without the rest of the class):
Settings settings =
ImmutableSettings.settingsBuilder().put("cluster.name (http://cluster.name)", clusterName).
.put(NetworkService.TcpSettings.TCP_CONNECT_TIMEOUT,
1).build();
TransportClient indexer = new TransportClient(settings);
String[] hosts = new String[] {"host1","host2"/* ,... */,
"hostN"};

for (int i = 0; i < hosts.length; i++) {
LOG.info (http://LOG.info)(" host : " + hosts[i]);
indexer = indexer.addTransportAddress(new
InetSocketTransportAddress(hosts[i], 9300));
}

If I run this I'll see the output
host: host1
host: host2
[... pause for several minutes ...]
And then I'll abort this with Ctrl+C. I've had this run fine
depending on connectivity, but if there's any pause at all it seems to
hang forever. I didn't get anything to improve without modifying the
ScheduledConnectNodesSampler to have an internal timeout

On Jun 9, 12:10 pm, Shay Banon <shay.ba...@elasticsearch.com (http://elasticsearch.com)> wrote:

Did not full understand your problem...

On Thursday, June 9, 2011 at 4:36 AM, Eric Mill wrote:

I wonder if this would explain the strange hanging behavior I've been seeing intermittently when trying to index documents. I can make it go away by blowing away the data repo and starting over, and it may not reoccur for a long time. The next time it happens, I'll start investigating the node endpoints to see if it's happening while a node is dead.

-- Eric

On Wed, Jun 8, 2011 at 7:02 PM, Derek Wollenstein <de...@klout.com (mailto:de...@klout.com (http://klout.com))> wrote:

I'm not sure if I'm doing things wrong, but I seem to be
encountering some problems adding registered addresses to a
TransportClient. The ScheduledConnectNodeSampler class inside of
TransportClientNodesService.java would block indefinitely trying to
refresh data from a given node. If that node happened to be dead at
the time, the entire client would come to a halt. I've modified this
to take a timeout in milliseconds to use when retrieving nodesInfo and
gotten this to run more smoothly, but I don't know if this indicates
that I'm doing something wrong. Thanks for your help

--Derek


(Shay Banon) #9

So, it stays hang on that lock (which basically waits for the response to come back)? Basically, when a node is identified as disconnected (on the client side), then it also "releases" all the ongoing messages that has been sent to it. How easy is it to recreate it? If it is simple, would you mind terribly jumping on IRC and I can create debug version of ES to test whats happening?

On Wednesday, June 15, 2011 at 1:23 AM, Derek Wollenstein wrote:

Shay-
I just got this to reproduce itself using default settings. This
basically is what happens when a previously connected node disappears
when using a local gateway, and unicast discovery. I attempted to
check for the same lock being waited on after running jstack,
sleeping, and running jstack again.

I've put the apparently hung thread in
https://gist.github.com/1026062

Actually I'll go ahead and updated this to be the complete thread
dump.

Threads "elasticsearch[cached]-pool-1-thread-80",
"elasticsearch[cached]-pool-1-thread-79", "elasticsearch[cached]-
pool-1-thread-78", and at least one more are exhibiting this problem.

On Jun 9, 5:17 pm, Shay Banon <shay.ba...@elasticsearch.com (http://elasticsearch.com)> wrote:

When you set it to 1, it means 1 millisecond, can you try and set it to "1s"? Also, when you see this hang, can you issue a thread dump, lets see where its stuck (gist it please).

On Friday, June 10, 2011 at 12:08 AM, Derek Wollenstein wrote:

Shay -

When I say "indefinite", what I mean is that I wrote some code
that does the following (without the rest of the class):
Settings settings =
ImmutableSettings.settingsBuilder().put("cluster.name (http://cluster.name)", clusterName).
.put(NetworkService.TcpSettings.TCP_CONNECT_TIMEOUT,
1).build();
TransportClient indexer = new TransportClient(settings);
String[] hosts = new String[] {"host1","host2"/* ,... */,
"hostN"};

for (int i = 0; i < hosts.length; i++) {
LOG.info (http://LOG.info)(" host : " + hosts[i]);
indexer = indexer.addTransportAddress(new
InetSocketTransportAddress(hosts[i], 9300));
}

If I run this I'll see the output
host: host1
host: host2
[... pause for several minutes ...]
And then I'll abort this with Ctrl+C. I've had this run fine
depending on connectivity, but if there's any pause at all it seems to
hang forever. I didn't get anything to improve without modifying the
ScheduledConnectNodesSampler to have an internal timeout

On Jun 9, 12:10 pm, Shay Banon <shay.ba...@elasticsearch.com (http://elasticsearch.com)> wrote:

Did not full understand your problem...

On Thursday, June 9, 2011 at 4:36 AM, Eric Mill wrote:

I wonder if this would explain the strange hanging behavior I've been seeing intermittently when trying to index documents. I can make it go away by blowing away the data repo and starting over, and it may not reoccur for a long time. The next time it happens, I'll start investigating the node endpoints to see if it's happening while a node is dead.

-- Eric

On Wed, Jun 8, 2011 at 7:02 PM, Derek Wollenstein <de...@klout.com (mailto:de...@klout.com (http://klout.com))> wrote:

I'm not sure if I'm doing things wrong, but I seem to be
encountering some problems adding registered addresses to a
TransportClient. The ScheduledConnectNodeSampler class inside of
TransportClientNodesService.java would block indefinitely trying to
refresh data from a given node. If that node happened to be dead at
the time, the entire client would come to a halt. I've modified this
to take a timeout in milliseconds to use when retrieving nodesInfo and
gotten this to run more smoothly, but I don't know if this indicates
that I'm doing something wrong. Thanks for your help

--Derek


(system) #10