o.e.d.z.UnicastZenPing - failed to resolve host: java.net.UnknownHostException

Hello. I have a simple problem. My elasticsearch nodes don't communicate each other. Could you help me? :worried:

Elasticsearch 6.7.0(installed from rpm)
CentOS Linux 7.6.1810
Linux 3.10.0-957.5.1.el7.x86_64 on x86_64

I did these:

  1. configured servers to use LAN IP pool(10.0.0.X/24) and stopped firewalls on all nodes
  2. configured .yml file on all nodes
  3. cross-checked with "curl 10.0.0.x:9200" and got successful results
  4. cross-checked nslookup with "nslookup esX" and got successful results

Here is the log of elasticsearch:

[2019-03-29T12:31:01,652][INFO ][o.e.e.NodeEnvironment ] [es1] using [1] data paths, mounts [[/ (rootfs)]], net usable_space [671.1gb], net total_space [899.5gb], types [rootfs]
[2019-03-29T12:31:01,655][INFO ][o.e.e.NodeEnvironment ] [es1] heap size [989.8mb], compressed ordinary object pointers [true]
[2019-03-29T12:31:01,683][INFO ][o.e.n.Node ] [es1] node name [es1], node ID [-JW1C2VySX6IQGQcBWQTXg]
[2019-03-29T12:31:01,684][INFO ][o.e.n.Node ] [es1] version[6.7.0], pid[18024], build[default/rpm/8453f77/2019-03-21T15:32:29.844721Z], OS[Linux/3.10.0-957.5.1.el7.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_201/25.201-b09]
[2019-03-29T12:31:01,684][INFO ][o.e.n.Node ] [es1] JVM arguments [-Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=/tmp/elasticsearch-5039937088361580624, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/lib/elasticsearch, -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -Xloggc:/var/log/elasticsearch/gc.log, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=32, -XX:GCLogFileSize=64m, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch, -Des.distribution.flavor=default, -Des.distribution.type=rpm]
[2019-03-29T12:31:03,194][INFO ][o.e.p.PluginsService ] [es1] loaded module [aggs-matrix-stats]
[2019-03-29T12:31:03,194][INFO ][o.e.p.PluginsService ] [es1] loaded module [analysis-common]
[2019-03-29T12:31:03,194][INFO ][o.e.p.PluginsService ] [es1] loaded module [ingest-common]
[2019-03-29T12:31:03,194][INFO ][o.e.p.PluginsService ] [es1] loaded module [ingest-geoip]
[2019-03-29T12:31:03,194][INFO ][o.e.p.PluginsService ] [es1] loaded module [ingest-user-agent]
[2019-03-29T12:31:03,194][INFO ][o.e.p.PluginsService ] [es1] loaded module [lang-expression]
[2019-03-29T12:31:03,194][INFO ][o.e.p.PluginsService ] [es1] loaded module [lang-mustache]
[2019-03-29T12:31:03,194][INFO ][o.e.p.PluginsService ] [es1] loaded module [lang-painless]
[2019-03-29T12:31:03,194][INFO ][o.e.p.PluginsService ] [es1] loaded module [mapper-extras]
[2019-03-29T12:31:03,194][INFO ][o.e.p.PluginsService ] [es1] loaded module [parent-join]
[2019-03-29T12:31:03,194][INFO ][o.e.p.PluginsService ] [es1] loaded module [percolator]
[2019-03-29T12:31:03,194][INFO ][o.e.p.PluginsService ] [es1] loaded module [rank-eval]
[2019-03-29T12:31:03,194][INFO ][o.e.p.PluginsService ] [es1] loaded module [reindex]
[2019-03-29T12:31:03,194][INFO ][o.e.p.PluginsService ] [es1] loaded module [repository-url]
[2019-03-29T12:31:03,194][INFO ][o.e.p.PluginsService ] [es1] loaded module [transport-netty4]
[2019-03-29T12:31:03,195][INFO ][o.e.p.PluginsService ] [es1] loaded module [tribe]
[2019-03-29T12:31:03,195][INFO ][o.e.p.PluginsService ] [es1] loaded module [x-pack-ccr]
[2019-03-29T12:31:03,195][INFO ][o.e.p.PluginsService ] [es1] loaded module [x-pack-core]
[2019-03-29T12:31:03,195][INFO ][o.e.p.PluginsService ] [es1] loaded module [x-pack-deprecation]
[2019-03-29T12:31:03,195][INFO ][o.e.p.PluginsService ] [es1] loaded module [x-pack-graph]
[2019-03-29T12:31:03,195][INFO ][o.e.p.PluginsService ] [es1] loaded module [x-pack-ilm]
[2019-03-29T12:31:03,195][INFO ][o.e.p.PluginsService ] [es1] loaded module [x-pack-logstash]
[2019-03-29T12:31:03,195][INFO ][o.e.p.PluginsService ] [es1] loaded module [x-pack-ml]
[2019-03-29T12:31:03,195][INFO ][o.e.p.PluginsService ] [es1] loaded module [x-pack-monitoring]
[2019-03-29T12:31:03,195][INFO ][o.e.p.PluginsService ] [es1] loaded module [x-pack-rollup]
[2019-03-29T12:31:03,195][INFO ][o.e.p.PluginsService ] [es1] loaded module [x-pack-security]
[2019-03-29T12:31:03,195][INFO ][o.e.p.PluginsService ] [es1] loaded module [x-pack-sql]
[2019-03-29T12:31:03,195][INFO ][o.e.p.PluginsService ] [es1] loaded module [x-pack-upgrade]
[2019-03-29T12:31:03,195][INFO ][o.e.p.PluginsService ] [es1] loaded module [x-pack-watcher]
[2019-03-29T12:31:03,196][INFO ][o.e.p.PluginsService ] [es1] no plugins loaded
[2019-03-29T12:31:06,447][INFO ][o.e.x.s.a.s.FileRolesStore] [es1] parsed [0] roles from file [/etc/elasticsearch/roles.yml]
[2019-03-29T12:31:06,995][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [es1] [controller/18145] [Main.cc@109] controller (64 bit): Version 6.7.0 (Build d74ae2ac01b10d) Copyright (c) 2019 Elasticsearch BV
[2019-03-29T12:31:07,355][DEBUG][o.e.a.ActionModule ] [es1] Using REST wrapper from plugin org.elasticsearch.xpack.security.Security
[2019-03-29T12:31:07,598][INFO ][o.e.d.DiscoveryModule ] [es1] using discovery type [zen] and host providers [settings]
[2019-03-29T12:31:08,255][INFO ][o.e.n.Node ] [es1] initialized
[2019-03-29T12:31:08,255][INFO ][o.e.n.Node ] [es1] starting ...
[2019-03-29T12:31:08,362][INFO ][o.e.t.TransportService ] [es1] publish_address {10.0.0.1:9300}, bound_addresses {10.0.0.1:9300}
[2019-03-29T12:31:08,385][INFO ][o.e.b.BootstrapChecks ] [es1] bound or publishing to a non-loopback address, enforcing bootstrap checks

[2019-03-29T12:31:08,433][WARN ][o.e.d.z.UnicastZenPing ] [es1] failed to resolve host [β€œes1”]
java.net.UnknownHostException: β€œes1”: Temporary failure in name resolution
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) ~[?:1.8.0_201]
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929) ~[?:1.8.0_201]
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324) ~[?:1.8.0_201]
at java.net.InetAddress.getAllByName0(InetAddress.java:1277) ~[?:1.8.0_201]
at java.net.InetAddress.getAllByName(InetAddress.java:1193) ~[?:1.8.0_201]
at java.net.InetAddress.getAllByName(InetAddress.java:1127) ~[?:1.8.0_201]
at org.elasticsearch.transport.TcpTransport.parse(TcpTransport.java:536) ~[elasticsearch-6.7.0.jar:6.7.0]
at org.elasticsearch.transport.TcpTransport.addressesFromString(TcpTransport.java:489) ~[elasticsearch-6.7.0.jar:6.7.0]
at org.elasticsearch.transport.TransportService.addressesFromString(TransportService.java:737) ~[elasticsearch-6.7.0.jar:6.7.0]
at org.elasticsearch.discovery.zen.UnicastZenPing.lambda$resolveHostsLists$0(UnicastZenPing.java:189) ~[elasticsearch-6.7.0.jar:6.7.0]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_201]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.7.0.jar:6.7.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_201]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
[2019-03-29T12:31:08,438][WARN ][o.e.d.z.UnicastZenPing ] [es1] failed to resolve host [β€œes2”]
java.net.UnknownHostException: β€œes2”: Temporary failure in name resolution
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) ~[?:1.8.0_201]
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929) ~[?:1.8.0_201]
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324) ~[?:1.8.0_201]
at java.net.InetAddress.getAllByName0(InetAddress.java:1277) ~[?:1.8.0_201]
at java.net.InetAddress.getAllByName(InetAddress.java:1193) ~[?:1.8.0_201]
at java.net.InetAddress.getAllByName(InetAddress.java:1127) ~[?:1.8.0_201]
at org.elasticsearch.transport.TcpTransport.parse(TcpTransport.java:536) ~[elasticsearch-6.7.0.jar:6.7.0]
at org.elasticsearch.transport.TcpTransport.addressesFromString(TcpTransport.java:489) ~[elasticsearch-6.7.0.jar:6.7.0]
at org.elasticsearch.transport.TransportService.addressesFromString(TransportService.java:737) ~[elasticsearch-6.7.0.jar:6.7.0]
at org.elasticsearch.discovery.zen.UnicastZenPing.lambda$resolveHostsLists$0(UnicastZenPing.java:189) ~[elasticsearch-6.7.0.jar:6.7.0]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_201]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.7.0.jar:6.7.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_201]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
[2019-03-29T12:31:11,443][WARN ][o.e.d.z.ZenDiscovery ] [es1] not enough master nodes discovered during pinging (found [[Candidate{node={es1}{-JW1C2VySX6IQGQcBWQTXg}{JaSkzeqjRsS5FDjuwwms_Q}{10.0.0.1}{10.0.0.1:9300}{ml.machine_memory=33535541248, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [3]), pinging again

The results from cross-checking network:

[root@dhpc01 ~]# ping 10.0.0.1 -c 1
PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.040 ms

--- 10.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.040/0.040/0.040/0.000 ms
[root@dhpc01 ~]# ping 10.0.0.2 -c 1
PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=0.136 ms

--- 10.0.0.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.136/0.136/0.136/0.000 ms
[root@dhpc01 ~]# ping es1 -c 1
PING es1 (10.0.0.1) 56(84) bytes of data.
64 bytes from dhpc01 (10.0.0.1): icmp_seq=1 ttl=64 time=0.042 ms

--- es1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.042/0.042/0.042/0.000 ms
[root@dhpc01 ~]# ping es2 -c 1
PING es2 (10.0.0.2) 56(84) bytes of data.
64 bytes from dhpc02 (10.0.0.2): icmp_seq=1 ttl=64 time=0.152 ms

--- es2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.152/0.152/0.152/0.000 ms

[root@dhpc01 ~]# curl es1:9200
{
"name" : "es1",
"cluster_name" : "demir-elastic",
"cluster_uuid" : "na",
"version" : {
"number" : "6.7.0",
"build_flavor" : "default",
"build_type" : "rpm",
"build_hash" : "8453f77",
"build_date" : "2019-03-21T15:32:29.844721Z",
"build_snapshot" : false,
"lucene_version" : "7.7.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}
[root@dhpc01 ~]# curl es2:9200
{
"name" : "es2",
"cluster_name" : "demir-elastic",
"cluster_uuid" : "na",
"version" : {
"number" : "6.7.0",
"build_flavor" : "default",
"build_type" : "rpm",
"build_hash" : "8453f77",
"build_date" : "2019-03-21T15:32:29.844721Z",
"build_snapshot" : false,
"lucene_version" : "7.7.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}
[root@dhpc01 ~]# nslookup es1
;; Got SERVFAIL reply from 193.140.25.30, trying next server
Server: 127.0.0.1
Address: 127.0.0.1#53

Name: es1
Address: 10.0.0.1

[root@dhpc01 ~]# nslookup es2
;; Got SERVFAIL reply from 193.140.25.30, trying next server
Server: 127.0.0.1
Address: 127.0.0.1#53

Name: es2
Address: 10.0.0.2

[root@dhpc02 ~]# curl es1:9200
{
"name" : "es1",
"cluster_name" : "demir-elastic",
"cluster_uuid" : "na",
"version" : {
"number" : "6.7.0",
"build_flavor" : "default",
"build_type" : "rpm",
"build_hash" : "8453f77",
"build_date" : "2019-03-21T15:32:29.844721Z",
"build_snapshot" : false,
"lucene_version" : "7.7.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}
[root@dhpc02 ~]# curl es2:9200
{
"name" : "es2",
"cluster_name" : "demir-elastic",
"cluster_uuid" : "na",
"version" : {
"number" : "6.7.0",
"build_flavor" : "default",
"build_type" : "rpm",
"build_hash" : "8453f77",
"build_date" : "2019-03-21T15:32:29.844721Z",
"build_snapshot" : false,
"lucene_version" : "7.7.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}
[root@dhpc02 ~]# nslookup es1
Server: 127.0.0.1
Address: 127.0.0.1#53

Name: es1
Address: 10.0.0.1

[root@dhpc02 ~]# nslookup es2
Server: 127.0.0.1
Address: 127.0.0.1#53

Name: es2
Address: 10.0.0.2

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: demir-elastic
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: es1
node.master: true
node.data: true
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /var/lib/elasticsearch
#
# Path to log files:
#
path.logs: /var/log/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 10.0.0.1
#
# Set a custom port for HTTP:
#
http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.zen.ping.unicast.hosts: [β€œes1” , β€œes2”, β€œes3” , β€œes4” , β€œes5” , β€œes6” , β€œes7” , β€œes8” , β€œes9” , β€œes10” , β€œes11” , β€œes12” , β€œes13” , β€œes14” , β€œes15” , β€œes16”]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
discovery.zen.minimum_master_nodes: 3
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true

Thanks for the detailed report @lemon_soft.

I'm suspicious about those SERVFAIL reply messages. It looks like the first lookup is failing and then nslookup is retrying with a different DNS server, and I am wondering if the JVM's DNS resolver doesn't do this. Can you adjust your DNS config so that these lookups succeed on the first try?

Thanks DavidTurner. I changed the order of DNS. Problem is still existing.

The reply from nslookup:

[root@dhpc01 ~]# nslookup es1
Server: 127.0.0.1
Address: 127.0.0.1#53

Name: es1
Address: 10.0.0.1

[root@dhpc01 ~]# nslookup es2
Server: 127.0.0.1
Address: 127.0.0.1#53

Name: es2
Address: 10.0.0.2

The error log again:

[2019-03-29T13:28:39,872][WARN ][o.e.d.z.ZenDiscovery ] [es1] not enough master nodes discovered during pinging (found [[Candidate{node={es1}{-JW1C2VySX6IQGQcBWQTXg}{7FRT-_vmQQmc181c03qWiw}{10.0.0.1}{10.0.0.1:9300}{ml.machine_memory=33535541248, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [3]), pinging again
[2019-03-29T13:28:39,873][WARN ][o.e.d.z.UnicastZenPing ] [es1] failed to resolve host [β€œes1”]
java.net.UnknownHostException: β€œes1”
at java.net.InetAddress.getAllByName0(InetAddress.java:1281) ~[?:1.8.0_201]
at java.net.InetAddress.getAllByName(InetAddress.java:1193) ~[?:1.8.0_201]
at java.net.InetAddress.getAllByName(InetAddress.java:1127) ~[?:1.8.0_201]
at org.elasticsearch.transport.TcpTransport.parse(TcpTransport.java:536) ~[elasticsearch-6.7.0.jar:6.7.0]
at org.elasticsearch.transport.TcpTransport.addressesFromString(TcpTransport.java:489) ~[elasticsearch-6.7.0.jar:6.7.0]
at org.elasticsearch.transport.TransportService.addressesFromString(TransportService.java:737) ~[elasticsearch-6.7.0.jar:6.7.0]
at org.elasticsearch.discovery.zen.UnicastZenPing.lambda$resolveHostsLists$0(UnicastZenPing.java:189) ~[elasticsearch-6.7.0.jar:6.7.0]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_201]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.7.0.jar:6.7.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_201]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
[2019-03-29T13:28:39,874][WARN ][o.e.d.z.UnicastZenPing ] [es1] failed to resolve host [β€œes2”]
java.net.UnknownHostException: β€œes2”
at java.net.InetAddress.getAllByName0(InetAddress.java:1281) ~[?:1.8.0_201]
at java.net.InetAddress.getAllByName(InetAddress.java:1193) ~[?:1.8.0_201]
at java.net.InetAddress.getAllByName(InetAddress.java:1127) ~[?:1.8.0_201]
at org.elasticsearch.transport.TcpTransport.parse(TcpTransport.java:536) ~[elasticsearch-6.7.0.jar:6.7.0]
at org.elasticsearch.transport.TcpTransport.addressesFromString(TcpTransport.java:489) ~[elasticsearch-6.7.0.jar:6.7.0]
at org.elasticsearch.transport.TransportService.addressesFromString(TransportService.java:737) ~[elasticsearch-6.7.0.jar:6.7.0]
at org.elasticsearch.discovery.zen.UnicastZenPing.lambda$resolveHostsLists$0(UnicastZenPing.java:189) ~[elasticsearch-6.7.0.jar:6.7.0]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_201]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.7.0.jar:6.7.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_201]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]

Hmm. Ok, it's definitely something wrong with name lookups still:

Just to check, did you restart the node after fixing the DNS config? I think it only picks up changes when restarted.

I think my next step would be to investigate what's actually happening at the DNS level using tcpdump, comparing the successful query generated by nslookup to the failing one generated by Elasticsearch. Perhaps this'll give us a clue.

Yes, I restarted elasticsearch.

Just to check, did you restart the node after fixing the DNS config? I think it only picks up changes when restarted.

I think my next step would be to investigate what's actually happening at the DNS level using tcpdump, comparing the successful query generated by nslookup to the failing one generated by Elasticsearch. Perhaps this'll give us a clue.

I ran a tcpdump(tcpdump -i any udp port 53 > tcpdump.txt). Here is the dump file download link(uploaded for text length limit).

I should also tell about DNSMasq. I don't use DNS server for resolving hosts. In my environment, I am not authorized to add/update DNS server records. I use DNSMasq whish is using /etc/hosts file to respond DNS queries.

[root@dhpc01 ~]# tcpdump -D
1.nflog (Linux netfilter log (NFLOG) interface)
2.nfqueue (Linux netfilter queue (NFQUEUE) interface)
3.em1
4.usbmon1 (USB bus number 1)
5.em2
6.usbmon2 (USB bus number 2)
7.any (Pseudo-device that captures on all interfaces)
8.lo [Loopback]

I noticed that a strange record in tcpdump.txt:

[root@dhpc01 ~]# ping es2.cs.deu.edu.tr
ping: es2.cs.deu.edu.tr: Name or service not known
[root@dhpc01 ~]# ping es2
PING es2 (10.0.0.2) 56(84) bytes of data.
64 bytes from dhpc02 (10.0.0.2): icmp_seq=1 ttl=64 time=0.159 ms
64 bytes from dhpc02 (10.0.0.2): icmp_seq=2 ttl=64 time=0.180 ms
^C
--- es2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.159/0.169/0.180/0.016 ms

I can't see any successful lookups over DNS in the log you shared.

The text output from tcpdump doesn't normally include enough information for diagnosis. In general it's better to look at the raw dump:

tcpdump -i any udp port 53 -s65535 -wtcpdump.pcap

Then open tcpdump.pcap in something like Wireshark.

However, I think I see the problem. The queries in the log look rather odd:

14:34:01.876591 IP dhpc01.cs.deu.edu.tr.27170 > tnz1030.tinaztepe.deu.edu.tr.domain: 40676+ A? M-bM-^@M-^\es2M-bM-^@M-^]. (27)

Here's what a normal query looks like on my laptop:

11:59:17.368076 IP 192.168.1.179.51532 > 192.168.1.12.53: 3345+ A? es1. (21)

Note all that extra junk M-bM-^@M-^\ that shouldn't be there. I looked more closely at your config file and it seems you have so-called "smart" quotes there:

discovery.zen.ping.unicast.hosts: [β€œes1” , β€œes2”, β€œes3” , β€œes4” , β€œes5” , β€œes6” , β€œes7” , β€œes8” , β€œes9” , β€œes10” , β€œes11” , β€œes12” , β€œes13” , β€œes14” , β€œes15” , β€œes16”]

Compare this to a correctly-formatted file:

discovery.zen.ping.unicast.hosts: ["es1" , "es2", "es3" , "es4" , "es5" , "es6" , "es7" , "es8" , "es9" , "es10" , "es11" , "es12" , "es13" , "es14" , "es15" , "es16"]

Note the subtly different quotation marks. Try fixing that.

I also note that you have discovery.zen.minimum_master_nodes: 3. This is appropriate for clusters with 4 or 5 master-eligible nodes. How many of your nodes are master-eligible? Normally you'd only list the master-eligible nodes in discovery.zen.ping.unicast.hosts, but you have listed 16 there. If you have 16 master-eligible nodes then discovery.zen.minimum_master_nodes should be set to 9.

1 Like

It worked after changing smart quotes! OMG!

Status: Green
Nodes: 16
Indices: 14
Memory: 4.8 GB / 15.5 GB
Total Shards: 25
Unassigned Shards: 0
Documents: 263,052
Data: 218.8 MB

Thank you so much @DavidTurner. You saved my week! Thanks.

I will dig this dump further:

Note all that extra junk M-bM-^@M-^\ that shouldn't be there.

Note:
I am crying because of using Webmin interface to edit files. Webmin text edit causes "smart" quote problem! Also elasticsearch does care about that.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.