Elasticsearch Issue I can't figure out (can't curl 9200)

I made a thread on reddit and was recommended to make one here: https://www.reddit.com/r/elasticsearch/comments/8dvkx0/elasticsearch_issue_i_cant_figure_out_cant_curl/

We have a really annoying issue that is effecting our production cluster so any guidance would really be appreciated.

Problem: I have a 3 node ES clustesr. Randomly a elasticsearch non-master node suddenly stops working. The service itself is running fine however curl (only on port 9200) simply times out:

curl localhost:9200 --verbose
* About to connect() to localhost port 9200 (#0)
*   Trying ::1...
* Connected to localhost (::1) port 9200 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: localhost:9200
> Accept: */*
>
*  Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer```

This happens intermittently at first for a duration of roughly around 10 minutes before curl times out all the time on port 9200. This can happen to one non-master node, or to both non-master nodes (at different times), and what is more alarming is that the search stops reponding however the cluster state remains green.

Only of the elasticsearch non-master nodes are unable to suddenly or intermittently ping the master. Which is confusing, because if it is network, then why can one node ping the master, but not the other? There's nothing significant in the master node logs.

When I check netstat, the Recv-Q number, within that 10 minute period that curl is interrmitently timing out, the Recv-Q is slowly building up until it remains at 129 permanantly:

netstat -plnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp6     129      0 :::9200                 :::*                    LISTEN      22554/java
tcp6       0      0 :::9300                 :::*                    LISTEN      22554/java```

I'm not sure why the above happens. I have the following settings configured:

net.ipv4.tcp_max_syn_backlog = 512
net.ipv4.tcp_syncookies = 1

But there's nothing in the messages logs about a SYN flood attack.

To temporarily get the node back working again I have to kill the elasticsearch process (doing a systemctl restart just causes it to hang), and then I can start it up again. The curl issue never effects the Master node, and I can curl other processes on the problem node just fine (ie. curl localhost:80 works, but localhost:9200 does not).

The reverese http proxy nginx simply returns a 504 or 502. All system metrics themselves look fine in terms of usage, same with ES, heap memory size, networking all looks normal before the drop off.

Our cluster is currently running on Virtual Machines in Azure.

Has anyone ever come across this problem before, or has any guidance on things I can check, or even double check in case I've missed something?

Within that 10 minute slot where curl interimittently times out, the effected non-master node logs show this:

[2018-04-26T06:10:09,764][TRACE][o.e.d.z.MasterFaultDetection] [10.201.35.90] [master] failed to ping [{10.201.35.92}{F8X4bDfJS9qbAyUKLnb4-g}{BwXNeJAFTWex_eSQ0Wq2VA}{10.201.35.92}{10.201.35.92:9300}], retry [1] out of [3]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [10.201.35.92][10.201.35.92:9300][internal:discovery/zen/fd/master_ping] request_id [203945] timed out after [30000ms]
    at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:908) [elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.2.jar:5.2.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]


[2018-04-26T06:16:24,817][TRACE][o.e.d.z.MasterFaultDetection] [10.201.35.90] [master] failed to ping [{10.201.35.92}{F8X4bDfJS9qbAyUKLnb4-g}{BwXNeJAFTWex_eSQ0Wq2VA}{10.201.35.92}{10.201.35.92:9300}], retry [2] out of [3]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [10.201.35.92][10.201.35.92:9300][internal:discovery/zen/fd/master_ping] request_id [204163] timed out after [30000ms]
    at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:908) [elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.2.jar:5.2.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]


[2018-04-26T06:16:54,817][TRACE][o.e.d.z.MasterFaultDetection] [10.201.35.90] [master] failed to ping [{10.201.35.92}{F8X4bDfJS9qbAyUKLnb4-g}{BwXNeJAFTWex_eSQ0Wq2VA}{10.201.35.92}{10.201.35.92:9300}], retry [3] out of [3]

[2018-04-26T06:17:13,122][DEBUG][o.e.d.z.MasterFaultDetection] [10.201.35.90] [master] restarting fault detection against master [{10.201.35.92}{F8X4bDfJS9qbAyUKLnb4-g}{BwXNeJAFTWex_eSQ0Wq2VA}{10.201.35.92}{10.201.35.92:9300}], reason [new cluster state received and we are monitoring the wrong master [null]]
[2018-04-26T06:17:13,122][DEBUG][o.e.d.z.ZenDiscovery     ] [10.201.35.90] got first state from fresh master [F8X4bDfJS9qbAyUKLnb4-g]
[2018-04-26T06:17:13,122][TRACE][o.e.c.s.ClusterService   ] [10.201.35.90] [22], isa_ids [l3xastpqTOSKLqWkwDCb_Q, 6GhEcWu5RIe5inax9fbU3w]
  1: p_term [19], isa_ids [wrm6io-hQPK0-Er6kdK5NQ, 2DTo_8ufR9uKJOTEPe4f9Q]
nodes:
   {10.201.35.92}{F8X4bDfJS9qbAyUKLnb4-g}{BwXNeJAFTWex_eSQ0Wq2VA}{10.201.35.92}{10.201.35.92:9300}, master
   {10.201.35.91}{NBvD2N1hTcOowYMJNa7U7Q}{nZixcQyRSay40jdUsnmH3A}{10.201.35.91}{10.201.35.91:9300}
   {10.201.35.90}{hQ-bvLrBT527V0kCC4Vr7Q}{EygR5Aj8QG2kdbZXPGiz7w}{10.201.35.90}{10.201.35.90:9300}, local
routing_table (version 1221):
-- index [[amg_sem_v16/Is4J0UVlQviucVOG0LufMg]]
----shard_id [amg_sem_v16][0]
--------[amg_sem_v16][0], node[NBvD2N1hTcOowYMJNa7U7Q], [R], s[STARTED], a[id=z1kJFy-dS2uWn8U5UXdEYw]
--------[amg_sem_v16][0], node[hQ-bvLrBT527V0kCC4Vr7Q], [P], s[STARTED], a[id=UfXmasPuQlGEtgJ5QwAR5Q]
----shard_id [amg_sem_v16][1]
--------[amg_sem_v16][1], node[F8X4bDfJS9qbAyUKLnb4-g], [P], s[STARTED], a[id=QwAfbHV8QGOfeAW0cssEOQ]
--------[amg_sem_v16][1], node[NBvD2N1hTcOowYMJNa7U7Q], [R], s[STARTED], a[id=aSY7gcbiR4uYWow0jY6QkQ]
----shard_id [amg_sem_v16][2]
--------[amg_sem_v16][2], node[F8X4bDfJS9qbAyUKLnb4-g], [R], s[STARTED], a[id=8whHS1JQSFSiEby3UM_q5A]
--------[amg_sem_v16][2], node[hQ-bvLrBT527V0kCC4Vr7Q], [P], s[STARTED], a[id=WkvXohrKRLGQNpfhq3b-VA]



[2018-04-26T06:17:57,822][TRACE][o.e.d.z.ZenDiscovery     ] [10.201.35.90] failed to send join request to master [{10.201.35.92}{F8X4bDfJS9qbAyUKLnb4-g}{BwXNeJAFTWex_eSQ0Wq2VA}{10.201.35.92}{10.201.35.92:9300}]
org.elasticsearch.ElasticsearchTimeoutException: java.util.concurrent.TimeoutException: Timeout waiting for task.
    at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:63) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:33) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.discovery.zen.MembershipAction.sendJoinRequestBlocking(MembershipAction.java:95) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.discovery.zen.ZenDiscovery.joinElectedMaster(ZenDiscovery.java:463) [elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:406) [elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.discovery.zen.ZenDiscovery.access$3800(ZenDiscovery.java:81) [elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1180) [elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.2.jar:5.2.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
    at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:232) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:67) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:61) ~[elasticsearch-5.2.2.jar:5.2.2]
    ... 10 more


[2018-04-26T06:33:56,251][TRACE][o.e.d.z.MasterFaultDetection] [10.201.35.90] [master] failed to ping [{az01-phoenix-search-prod-03-phoenix-search-pro
d}{F8X4bDfJS9qbAyUKLnb4-g}{BwXNeJAFTWex_eSQ0Wq2VA}{10.201.35.92}{10.201.35.92:9300}], retry [1] out of [3]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [10.201.35.92][10.201.35.92:9300][internal:discovery/zen/fd/master_ping] request_id [204
351] timed out after [30000ms]
    at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:908) [elasticsearch-5.2.2.jar:5.2.2]

bump

As you have 3 nodes in the cluster I would recommend that you set all nodes to be master eligible )if you have not already done that) and set minimum_master_nodes to 2 according to these guidelines.

In order for us to get a bit better understanding of what your cluster looks like, it would be great if you could provide the full output of the cluster statistics API.

Do you have any non-standard settings in your node configs? It would also help if you could look through your Elasticsearch logs to see if any of the nodes are reporting anything related to e.g. long or frequent GC.

Hello,

Please see requested information below, we are still having major issues with this:

Kernel version (Centos 7):

Linux a2-search-cluster-01.com 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

elasticsearch.yml:

cat /etc/elasticsearch/company-search-prod/elasticsearch.yml
### MANAGED BY PUPPET ###
---
bootstrap.memory_lock: true
cluster.name: company-search-prod
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts:
- 10.223.230.1
- 10.223.230.2
- 10.223.230.3
gateway.expected_nodes: 3
gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
network.host: 0.0.0.0
node.name:  a2-search-cluster-01.com
path.data: "/data01/elasticsearch/data/company-search-prod"
path.logs: "/var/log/elasticsearch/company-search-prod"
path.repo: "/data-share/backups/elasticsearch-snapshots-prod/es5-prod-0009"
script.inline: true
script.stored: true

curl localhost:9200/_cluster/stats?pretty=true

  {
      "_nodes" : {
        "total" : 3,
        "successful" : 3,
        "failed" : 0
      },
      "cluster_name" : "company-search-prod",
      "timestamp" : 1526991893507,
      "status" : "green",
      "indices" : {
        "count" : 9,
        "shards" : {
          "total" : 40,
          "primaries" : 20,
          "replication" : 1.0,
          "index" : {
            "shards" : {
              "min" : 2,
              "max" : 6,
              "avg" : 4.444444444444445
            },
            "primaries" : {
              "min" : 1,
              "max" : 3,
              "avg" : 2.2222222222222223
            },
            "replication" : {
              "min" : 1.0,
              "max" : 1.0,
              "avg" : 1.0
            }
          }
        },
        "docs" : {
          "count" : 1540258,
          "deleted" : 475475
        },
        "store" : {
          "size_in_bytes" : 22657334218,
          "throttle_time_in_millis" : 0
        },
        "fielddata" : {
          "memory_size_in_bytes" : 0,
          "evictions" : 0
        },
        "query_cache" : {
          "memory_size_in_bytes" : 0,
          "total_count" : 395,
          "hit_count" : 0,
          "miss_count" : 395,
          "cache_size" : 0,
          "cache_count" : 0,
          "evictions" : 0
        },
        "completion" : {
          "size_in_bytes" : 0
        },
        "segments" : {
          "count" : 610,
          "memory_in_bytes" : 67467256,
          "terms_memory_in_bytes" : 47913144,
          "stored_fields_memory_in_bytes" : 4057472,
          "term_vectors_memory_in_bytes" : 0,
          "norms_memory_in_bytes" : 1465088,
          "points_memory_in_bytes" : 915936,
          "doc_values_memory_in_bytes" : 13115616,
          "index_writer_memory_in_bytes" : 0,
          "version_map_memory_in_bytes" : 0,
          "fixed_bit_set_memory_in_bytes" : 0,
          "max_unsafe_auto_id_timestamp" : -1,
          "file_sizes" : { }
        }
      },
      "nodes" : {
        "count" : {
          "total" : 3,
          "data" : 3,
          "coordinating_only" : 0,
          "master" : 3,
          "ingest" : 3
        },
        "versions" : [
          "5.2.2"
        ],
        "os" : {
          "available_processors" : 12,
          "allocated_processors" : 12,
          "names" : [
            {
              "name" : "Linux",
              "count" : 3
            }
          ],
          "mem" : {
            "total_in_bytes" : 44086677504,
            "free_in_bytes" : 2008760320,
            "used_in_bytes" : 42077917184,
            "free_percent" : 5,
            "used_percent" : 95
          }
        },
        "process" : {
          "cpu" : {
            "percent" : 0
          },
          "open_file_descriptors" : {
            "min" : 275,
            "max" : 276,
            "avg" : 275
          }
        },
        "jvm" : {
          "max_uptime_in_millis" : 438357613,
          "versions" : [
            {
              "version" : "1.8.0_131",
              "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
              "vm_version" : "25.131-b11",
              "vm_vendor" : "Oracle Corporation",
              "count" : 3
            }
          ],
          "mem" : {
            "heap_used_in_bytes" : 1248795464,
            "heap_max_in_bytes" : 22443982848
          },
          "threads" : 152
        },
        "fs" : {
          "total_in_bytes" : 804948357120,
          "free_in_bytes" : 763360251904,
          "available_in_bytes" : 722400366592,
          "spins" : "true"
        },
        "plugins" : [
          {
            "name" : "ES-REST",
            "version" : "5.2.17",
            "description" : "Company Custom _companysearch Plugin end-point for search",
            "classname" : "com.company.elasticsearch.plugin.companyREST.EscompanyRestPlugin"
          }
        ],
        "network_types" : {
          "transport_types" : {
            "netty4" : 3
          },
          "http_types" : {
            "netty4" : 3
          }
        }
      }
    }

It looks like you have a custom plugin ES-REST that based on the name might have an impact on this. Do you see the same problem without this plugin?

@Christian_Dahlqvist I work with @crypto-99

The approach so far has been to get to the bottom of the issue with the ES-REST plugin in place as this contains our custom search end points. This time managed to get a thread dump when the issue manifested again.

In the thread dump, it seems that many of the threads are just in waiting.
Could you please have a look at the thread dump in the below link and let us know if you can infer anything from it?

Thanks

Hi!

I have faced in the past likely situations where the non-master nodes seemed to have a problem as the cluster related queries were timing-out while the master was reporting a green cluster and looked like the only healthy node :). Also in Azure VMs by the way, and also weird ping issues.

In these cases the problem was usually at the elected master and not at the non-master nodes. Restarting the master solved the issue in my case, at least for a few days or weeks.

I was suspecting that there was a communication problem between the elected master and the other nodes, or the transport service was somehow frozen or not listening properly, maybe caused by a huge amount of traffic or some limits in Azure physical or virtual hosts. To me it looked like an hypervisor or networking problem at cloud side, but I didn't have the time for further investigation.

In your case, besides your intention of using the customized plugin, it would be great to know if the problem is also happening without the plugin, as @Christian_Dahlqvist was suggesting, but anyway I will try to take a look to the thread dump as soon as I find some time.

Thanks @eedugon & @Christian_Dahlqvist for your suggestions.

In the last two days we carried out 15 load test runs, hitting the targeted endpoints with roughly 60k requests over a period of 30 mins.
Out of the 15, 6 runs were carried out after removing the ES-REST plugin and we didn't see the issue during any of those runs (Node didn't become unresponsive over http-9200), we did see some requests getting timed-out (On avg 10 out of the total 60K timed-out, only once there was substantially more timeouts 600+) though.

For the remaining 9 runs out of the 15, we left the plugin installed. During these runs, we saw the issue based on which end point we were targetting. 5 out of those 9 runs failed and the commonality amongst that run was a particular end point in the ES-REST plugin. And when we looked into the thread dump of one of those failures we did notice that there were threads pertaining to that end point in parked state while trying to do one of the following actions

SortedMap<String, AliasOrIndex> lookup = client.admin().cluster()
        .prepareState().execute()
        .actionGet().getState()
        .getMetaData().getAliasAndIndexLookup();

or

GetIndexResponse getIndexResponse = 
    client.admin().indices().prepareGetIndex().setIndices(index).get();

Is there any reason why performing the above actions one after the other (not immediately though) as part of processing every user request could cause the node to become unresponsive.
Also is there any better alternative to get all the indices and aliases in the cluster at a given point in time, will something like cluster state work instead?

clusterService.state().getMetaData().indices().get(index).getIndex();

Finally is the below approach the best way to get hold of a document mapper for a particular type

DocumentMapper docMapper = 
 indicesService.indexServiceSafe(indexObj).mapperService().documentMapper(type);

Please suggest.

One other thing worth mentioning is that when the issue manifested, we did try to look at the current tasks in cluster using the _tasks endpoint.

Below are the results of two requests carried out several minutes apart and as you could see many of the tasks in the non master nodes look the same, only the id's of task/action "cluster:monitor/tasks/lists[n]" seems to be different when comparing both the results.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.