Shaky Cross-Cluster Search between 6.5.1 and 5.6.2

Hi everyone,

I originally intended to post the full content here, but discuss wouldn't allow me as the content was too long. Feeling that it is important to explain the full context, I decided to post it on SO instead and link it here:

We're looking for expert advice on how to solve this situation, thanks again for your attention.

[2019-02-14T23:53:52,630][WARN ][o.e.t.n.Netty4Transport  ] [IK-PRD-M3] exception caught on transport layer [[id: 0xd97a9d8c, L:/10.10.1.184:51594 - R:10.10.1.166/10.10.1.166:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [7719647], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.transport.TransportActionProxy$ProxyResponseHandler@7f2fcd88], error [false]; resetting
    at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1399) ~[elasticsearch-5.6.2.jar:5.6.2]
    at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) ~[transport-netty4-5.6.2.jar:5.6.2]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) [netty-codec-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) [netty-codec-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.13.Final.jar:4.1.13.Final]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_74]

This is, I think, the source of your woes, and sounds like a bug. It's similar to https://github.com/elastic/elasticsearch/issues/28713 except that this was fixed in 6.2.3. This sort of serialisation issue is a royal pain to track down, and typically we need to see a packet capture between the two nodes that includes the unreadable response.

Can you tell us more about the two nodes with addresses mentioned here:

L:/10.10.1.184:51594 - R:10.10.1.166/10.10.1.166:9300

Specifically which versions are they? Are you using TLS? If not, can you grab a packet capture of the traffic between these nodes on port 9300 that includes one of these exceptions and share it with us?

2 Likes

Thanks for your answer @DavidTurner !

  • 10.10.1.184 (IK-PRD-M3) is one of the master nodes of the older cluster (5.6.2), i.e. the one on which this log originally came from.
  • 10.10.1.166 (IK-PRD-D103) is one of the data nodes in the same cluster (5.6.2), i.e. the one that gets dropped as a consequence of that issue.

So it does look a bit different than the issue you're linking to since this is all happening within the 5.6.2 cluster, not the 6.5.1 one, which is fine.

We're not using TLS. It's not always the same node that gets dropped, but I'll see how we can provide you with a packet capture between the master node and another one that often gets dropped.

@DavidTurner
We have captured some exchanges before, during and after the exceptions occur.
Let me know how I can send them to you... discuss won't allow me to send you a private message right now.
Thanks

You can email them to me at david.turner@elastic.co.

Thanks, I sent them.

Also, for others encountering the same situation, it is worth noting that we somehow "mitigated" the situation by increasing the number of replicas per shard from 1 to 3. That way when 2-3 nodes get dropped within a few seconds, we have less chances to get into the red zone since there's one shard left to keep us in the yellow state. That's not an exact science, but it helps a bit...

It looks like this is the issue fixed by #26881 in 5.6.3. At least, I see a response matching the description of that issue just before a connection is dropped - unfortunately the logs corresponding with the packet capture were not available so I can't be 100% sure, but it seems pretty likely.

We'd generally recommend running the whole cluster on a single version, so upgrading everything to a later version in the 5.6 series seems like the best path forward. You say in the Stack Overflow post that you tried adding some upgraded client nodes and this actually made things worse. I'm not sure why that would be, and it might be a different issue. We try and test many mixed-version scenarios to catch this kind of issue, but a 6.5.1 cluster talking via CCS to a mixed 5.6.3/5.6.2 cluster searching 2.x indices is definitely outside what the test suite covers today.

1 Like

Thank you David for investigating this. We will upgrade all 5.6.2 nodes to 5.6.3 to see if #26881 effectively fixes the problem. If not, I'll provide a new packet capture + logs from two nodes that exhibit the issue.

I'll report our findings here in any case after we're done upgrading.

2 Likes

@DavidTurner we've finally taken the time to upgrade our 5.6.2 cluster to 5.6.3 to see if the problem gets solved... :drum:

And it was indeed the case, we have a green cluster again powered by ES 5.6.3 and we've seen no issues (leaving nodes, etc) for the past few hours.

We'll keep our fingers crossed that this was the right fix. Thank you so much for investigating our packet captures and your guidance. Much appreciated!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.