Client trying to route document to non-existent node?

Dandy · July 13, 2020, 12:33am

I'm in a bare-metal Kubernetes environment, and I see that documents sent from a fluent instance are being routed from my ingestor client to a non-existent data node:

2020-07-13 00:08:18 +0000 [warn]: #0 [elasticsearch] failed to flush the buffer. retry_time=12 next_retry_seconds=2020-07-13 00:42:42 +0000 chunk="5aa1cb4f03d198373ac2415b4d783073" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"host\", :port=>443, :scheme=>\"https\", :user=>\"user\", :password=>\"obfuscated\"}): No route to host - connect(2) for 10.244.5.18:9200 (Errno::EHOSTUNREACH)"

Specifically: 10.244.5.18:9200

References a node that doesn't exist according to the _nodes endpoint.

I'm also seeing the same error for nodes that do exist...

No route to host - connect(2) for 10.244.21.178:9200 (Errno::EHOSTUNREACH)"

        "cT6aQm2sRGm82NWv9aEyHw": {
            "name": "master-0",
            "transport_address": "10.244.21.178:9300",
            "host": "10.244.21.178",
            "ip": "10.244.21.178",

Note that from within the client itself, I can reach that address:

[root@elasticsearch-es-client-f65788c6b-qqmhp elasticsearch]# curl 10.244.21.178:9200 -u xxxxx:yyyyyy
{
  "name" : "master-0",
  ...
}

I figure this may be a Kubernetes specific issue with routing, but also the fact that the old IP still being references for some documents is concerning to me.

Where exactly is the information for the nodes fetched from by the clients? I want to see if that is outdated and update it somehow, but I don't know where to look.

Any immediate ideas on why this may happen?

I'm checking the CNI side of things as we speak.

Dandy · July 13, 2020, 11:29pm

Bump

forloop · July 14, 2020, 12:14am

Can you provide some additional details

Which client are you using?
How is the client's connection configured? For example, is it using sniffing, such as a sniffing connection pool?

Dandy · July 14, 2020, 12:19am

@forloop

By client, I mean the Elasticsearch Client nodes, fluentd is forwarding the documents.
It uses the default internal cluster transport on 9300 and whatever default sniffing is used, they are all within the same physical (k8s) network namespace.

So this error is logged in fluentd, but it's from the Elasticsearch ingest/client node.

Edit: Sorry I realise you also might mean fluentd sending to ES Client nodes: They use a k8s loadbalancer that forwards to a service which subsequently round robins the traffic.

Note, the error shows an internal k8s IP, which to me implies fluentd -> ES Client is OK and working as expected. It's the ES Client -> ES Data that seems to having some issues.

forloop · July 14, 2020, 12:40am

I don't really have experience with this kind of setup, I'm afraid.

If the client/coordinating nodes are part of the cluster, which can be checked with _cat/nodes, then this would imply to me that the issue is external to Elasticsearch, since client/coordintating nodes will be distributing requests to other nodes in the cluster using the transport layer (default port 9300), but the error message relates to an issue on port 9200, the default port for the HTTP layer.

Dandy · July 14, 2020, 12:58am

@forloop Thank you for the input, I'm also scratching my head as I think the error is a red herring of a bigger problem.

I'll keep trying to knock out what I can and report back if/when I find out what's causing it.

Dandy · July 15, 2020, 11:58pm

Bumping this as I've removed client / ingestor nodes entirely so this is fluent coming from outside a cluster using an external ingress directly to the data nodes and getting this error.

Note - Other fluents running elsewhere are fine.

system · August 12, 2020, 11:58pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.