Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.
Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
d101.
Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.
Let me know if you need any more details or log captures with more
details.
Thanks for the info, certainly something I would like to fix. You said you saw the d101 node being detected as disconnected from the cluster (resulting in the rebalancing of shards), but I don't see it in the logs. Maybe its in the previous day logs?
-shay.banon
On Friday, March 11, 2011 at 9:30 AM, Paul wrote:
Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.
Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
d101.
Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.
Let me know if you need any more details or log captures with more
details.
Thanks for the info, certainly something I would like to fix. You said you saw the d101 node being detected as disconnected from the cluster (resulting in the rebalancing of shards), but I don't see it in the logs. Maybe its in the previous day logs?
Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.
Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
d101.
Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.
Let me know if you need any more details or log captures with more
details.
Running more tests where I have every shard on every node to see if I
can get the same behavior. So far, I haven't been able to reproduce.
I wanted to note that I do not get any cluster disconnect indications
in the logs with default logging on 15.2. All I see is a timeout
exception occur for an index operation.
Please let me know any further details needed to debug and I would be
happy to get any logs needed, just let me know what levels you want me
to bump up.
Thanks for the info, certainly something I would like to fix. You said you saw the d101 node being detected as disconnected from the cluster (resulting in the rebalancing of shards), but I don't see it in the logs. Maybe its in the previous day logs?
Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.
Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
d101.
Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.
Let me know if you need any more details or log captures with more
details.
Got you. So, to validate the scenario, you disconnect d101 from the network, and keep indexing docs, and later reconnect it. It is not considered as disconnected from the cluster (since the fault detection has not indicated that, since it did not break its timeout limit), but shards move around because the index operation failed (they timeout).
If thats the case, I can also try and recreate it locally.
On Friday, March 11, 2011 at 11:35 PM, Paul wrote:
Running more tests where I have every shard on every node to see if I
can get the same behavior. So far, I haven't been able to reproduce.
I wanted to note that I do not get any cluster disconnect indications
in the logs with default logging on 15.2. All I see is a timeout
exception occur for an index operation.
Please let me know any further details needed to debug and I would be
happy to get any logs needed, just let me know what levels you want me
to bump up.
Thanks for the info, certainly something I would like to fix. You said you saw the d101 node being detected as disconnected from the cluster (resulting in the rebalancing of shards), but I don't see it in the logs. Maybe its in the previous day logs?
Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.
Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
d101.
Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.
Let me know if you need any more details or log captures with more
details.
Great, thanks. I am a little surprised that a failed index operation
would kick in a cluster rebalance. I figured the index operation would
just fail and the cluster wouldn't start rebalancing until the TCP
timeout kicked in.
I'd be very happy to give any potential fixes a spin on my side to see
if I can still reproduce.
Got you. So, to validate the scenario, you disconnect d101 from the network, and keep indexing docs, and later reconnect it. It is not considered as disconnected from the cluster (since the fault detection has not indicated that, since it did not break its timeout limit), but shards move around because the index operation failed (they timeout).
If thats the case, I can also try and recreate it locally.
On Friday, March 11, 2011 at 11:35 PM, Paul wrote:
Running more tests where I have every shard on every node to see if I
can get the same behavior. So far, I haven't been able to reproduce.
I wanted to note that I do not get any cluster disconnect indications
in the logs with default logging on 15.2. All I see is a timeout
exception occur for an index operation.
Please let me know any further details needed to debug and I would be
happy to get any logs needed, just let me know what levels you want me
to bump up.
Thanks for the info, certainly something I would like to fix. You said you saw the d101 node being detected as disconnected from the cluster (resulting in the rebalancing of shards), but I don't see it in the logs. Maybe its in the previous day logs?
Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.
Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
d101.
Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.
Let me know if you need any more details or log captures with more
details.
Curious can this same logic can be applied when the mapping for a
field is missing in an index? Right now, if you want a search to span
two indexes it will fail if the sort field isn't defined in both. It'd
be convenient to optionally treat the one missing the sort field
mapping as null.
Great, thanks. I am a little surprised that a failed index operation
would kick in a cluster rebalance. I figured the index operation would
just fail and the cluster wouldn't start rebalancing until the TCP
timeout kicked in.
I'd be very happy to give any potential fixes a spin on my side to see
if I can still reproduce.
Got you. So, to validate the scenario, you disconnect d101 from the network, and keep indexing docs, and later reconnect it. It is not considered as disconnected from the cluster (since the fault detection has not indicated that, since it did not break its timeout limit), but shards move around because the index operation failed (they timeout).
If thats the case, I can also try and recreate it locally.
On Friday, March 11, 2011 at 11:35 PM, Paul wrote:
Running more tests where I have every shard on every node to see if I
can get the same behavior. So far, I haven't been able to reproduce.
I wanted to note that I do not get any cluster disconnect indications
in the logs with default logging on 15.2. All I see is a timeout
exception occur for an index operation.
Please let me know any further details needed to debug and I would be
happy to get any logs needed, just let me know what levels you want me
to bump up.
Thanks for the info, certainly something I would like to fix. You said you saw the d101 node being detected as disconnected from the cluster (resulting in the rebalancing of shards), but I don't see it in the logs. Maybe its in the previous day logs?
Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.
Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
d101.
Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.
Let me know if you need any more details or log captures with more
details.
I think you posted this on the wrong thread :). Maybe it relates to handling missing values when sorting? If so, it might be possible to handle cases when teh field is not defined (though a bit tricky), open an issue for it so we won't loose track.
On Tuesday, March 15, 2011 at 12:49 AM, Paul wrote:
Ah, cool stuff.
Curious can this same logic can be applied when the mapping for a
field is missing in an index? Right now, if you want a search to span
two indexes it will fail if the sort field isn't defined in both. It'd
be convenient to optionally treat the one missing the sort field
mapping as null.
Great, thanks. I am a little surprised that a failed index operation
would kick in a cluster rebalance. I figured the index operation would
just fail and the cluster wouldn't start rebalancing until the TCP
timeout kicked in.
I'd be very happy to give any potential fixes a spin on my side to see
if I can still reproduce.
Got you. So, to validate the scenario, you disconnect d101 from the network, and keep indexing docs, and later reconnect it. It is not considered as disconnected from the cluster (since the fault detection has not indicated that, since it did not break its timeout limit), but shards move around because the index operation failed (they timeout).
If thats the case, I can also try and recreate it locally.
On Friday, March 11, 2011 at 11:35 PM, Paul wrote:
Running more tests where I have every shard on every node to see if I
can get the same behavior. So far, I haven't been able to reproduce.
I wanted to note that I do not get any cluster disconnect indications
in the logs with default logging on 15.2. All I see is a timeout
exception occur for an index operation.
Please let me know any further details needed to debug and I would be
happy to get any logs needed, just let me know what levels you want me
to bump up.
Thanks for the info, certainly something I would like to fix. You said you saw the d101 node being detected as disconnected from the cluster (resulting in the rebalancing of shards), but I don't see it in the logs. Maybe its in the previous day logs?
Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.
Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
d101.
Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.
Let me know if you need any more details or log captures with more
details.
I think you posted this on the wrong thread :). Maybe it relates to handling missing values when sorting? If so, it might be possible to handle cases when teh field is not defined (though a bit tricky), open an issue for it so we won't loose track.
On Tuesday, March 15, 2011 at 12:49 AM, Paul wrote:
Ah, cool stuff.
Curious can this same logic can be applied when the mapping for a
field is missing in an index? Right now, if you want a search to span
two indexes it will fail if the sort field isn't defined in both. It'd
be convenient to optionally treat the one missing the sort field
mapping as null.
Great, thanks. I am a little surprised that a failed index operation
would kick in a cluster rebalance. I figured the index operation would
just fail and the cluster wouldn't start rebalancing until the TCP
timeout kicked in.
I'd be very happy to give any potential fixes a spin on my side to see
if I can still reproduce.
Got you. So, to validate the scenario, you disconnect d101 from the network, and keep indexing docs, and later reconnect it. It is not considered as disconnected from the cluster (since the fault detection has not indicated that, since it did not break its timeout limit), but shards move around because the index operation failed (they timeout).
If thats the case, I can also try and recreate it locally.
On Friday, March 11, 2011 at 11:35 PM, Paul wrote:
Running more tests where I have every shard on every node to see if I
can get the same behavior. So far, I haven't been able to reproduce.
I wanted to note that I do not get any cluster disconnect indications
in the logs with default logging on 15.2. All I see is a timeout
exception occur for an index operation.
Please let me know any further details needed to debug and I would be
happy to get any logs needed, just let me know what levels you want me
to bump up.
Thanks for the info, certainly something I would like to fix. You said you saw the d101 node being detected as disconnected from the cluster (resulting in the rebalancing of shards), but I don't see it in the logs. Maybe its in the previous day logs?
Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.
Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
d101.
Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.
Let me know if you need any more details or log captures with more
details.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.