CAT api doesn't respond


(Sumanth Reddy) #1

Hi, I have a 4 node elasticsearch cluster on version 5.6.9, all running on containers with persistent backend and with a haproxy load balancer infront. At times there seems to be issue with only CAT api not responding. I'm able to get response for all API(cluster, indices etc) but no response only for _cat api endpoints. The issue goes away once i restart all the nodes/containers in the cluster!! I have tried reaching the node directly in cluster by passing the haproxy LB using the ip address , but the result is the same.

I have tried all endpoints under _cat with no luck.

There are multiple but similar 4 node clusters running same docker image versions of 5.6.9 with this issue appearing time and time again across cluster at different times.

Any help/pointers appreciated. Thanks.


(David Turner) #2

This seems strange. There's nothing particularly unusual about the _cat APIs: for instance GET _cat/health does almost exactly the same thing that GET _cluster/health does.

Many of the _cat APIs will contact all nodes in the cluster, and I therefore suspected node-to-node connectivity issues. But then I think this would affect other APIs like GET _nodes/stats. Could you confirm that this affects

GET _cat/health
GET _cat/nodes

but does not affect

GET _cluster/health
GET _nodes/stats

Are there any log messages you can share?


(Sumanth Reddy) #3

Thanks David. I would try this out the next time the issue surfaces. This is kinda intermittent.

With respect to endpoints, i didn't try out the _nodes/stats nor _cat/nodes, but i did try the below sequence.

GET _cluster/health - Works
GET _cat/recovery - No Response
GET _cat/shards - No Response

In regards to logs, i didn't find any logs with respect to this. Also is there any logging level which i need to set to understand more about my cluster?

Also restarting the cluster makes this go away. But one common thing i observed is that, there is always a container(node in this case) stuck while restarting the cluster during this issue. Had to force stop it.


(Sumanth Reddy) #4

Hi David,

We got the similar issue again and these are some of our observations while fixing at it.

As we previously found that CAT API’s were not responding while the cluster health endpoint was still Green. We found the following for various endpoints during this issue, as you suggested.

GET _cat/health - Works fine
GET _cat/nodes - Gives Time out

GET _cluster/health – Works fine
GET _nodes/stats – Gives Time out

Looks like the node status endpoints have issue responding back, as you previously suspected.

The only log messages are the usual warnings from the license service about the cluster being unlicensed.

Thank You.


(David Turner) #5

Ok, good, it's not just the _cat APIs. That would have been mysterious.

I would recommend sorting this out. A basic licence is free, and we do not test the behaviour of an unlicensed cluster very thoroughly so there's a possibility that this is related.

When you say "Gives Time out" can you share the exact output from Elasticsearch, bypassing the load balancer? Note that many clients and proxies have an inbuilt timeout, so it could be that you're hitting that and not seeing a more useful response from Elasticsearch. I mean these requests shouldn't be taking very long anyway, but since they are it's possible that the actual response will be more useful.

I would focus on GET _nodes or GET _nodes/stats since it will provide more detail than the _cat versions.


(Sumanth Reddy) #6

Sure David. Will try this out and let you know the findings. Thank you very much for the help.


(Sumanth Reddy) #7

I have updated the cluster with the basic license as suggested and the License related logs messages are not appearing anymore in the logs.

Below are the various details with IP's.

Load Balancer -
lb-1 172.19.122.4
lb-2 172.19.62.65

4 Node ES cluster -
node-1 172.19.68.62 --> master
node-2 172.19.68.215
node-3 172.19.2.18
node-4 172.19.82.19

I tried doing curl 172.19.2.18:9200/_cat/nodes bypassing the LB, but i didn't get any response even though i have waited for ~15 mins!!! I have tried with master node as well with same result.

When i changed the Log level to Debug, one log messages which i see consistently across the cluster where this issue exists is as below:

caught exception while handling client http traffic, closing connection [id: 0x47397fe6, L:/172.19.2.18:9200 - R:/172.19.122.4:52492]
 	
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?]
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:?]
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:?]
	at sun.nio.ch.IOUtil.read(IOUtil.java:197) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[?:?]
	at io.netty.buffer.PooledHeapByteBuf.setBytes(PooledHeapByteBuf.java:261) ~[netty-buffer-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1100) ~[netty-buffer-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:372) ~[netty-transport-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.13.Final.jar:4.1.13.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]

This appears even if i do not hit the endpoints where we are seeing issue(viz., _nodes or _nodes/stats)

Let me know your thoughts.

Thank You.


(David Turner) #8

Thanks for the extreme patience :slight_smile: That's useful to know.

One possibility here is that the threadpool being used to serve these requests is exhausted doing something else, and that something else is stuck for some reason. Slightly annoyingly, the obvious API call to diagnose this further, GET _tasks, will also be blocked since it runs on the same threadpool.

However I think GET /_nodes/hot_threads?threads=9999 should still work, so please try running that and share the full output here. You might need to put it in a Gist since it will be quite long. If the hot threads API doesn't work either then you can use jstack on each node in turn to grab a thread dump.


(David Turner) #9

Sorry, I didn't answer this bit. That exception indicates that an HTTP client sent a request and then closed the connection before receiving a response. This is at least consistent with the behaviour that you're reporting, but I don't think it tells us anything new.


(Paul Krizak) #10

Hi, @DavidTurner, I work with @sumanthkumarc on the same team. One of our ES clusters hung this morning in the same way we have been discussing here. Per your suggestion, I did a GET /_nodes/hot_threads?threads=9999: https://gist.github.com/skaven81/38dea9faca3cec9bed65c43aa43fdfbb

As you suspected, GET /_tasks is blocked.


(David Turner) #11

Thanks, I think that helps.

The problematic node is {tomorrowland-scheduler-itcontracts-elasticsearch-4}{fVEHh1LjR9ybJuXz0FmRCw}{M7N6gcVxS-KDeIIhm18kbA}{172.19.194.197}{172.19.194.197:9300}{ml.max_open_jobs=10, ml.enabled=true} - we can see that all 5 management threads are blocked in IndexShard.translogStats waiting to acquire a lock on the translog. I also see three indexing threads that are blocked waiting for space in the sync queue, so there must be at least 1024 pending sync requests.

The thread that holds the lock seems to be stuck waiting for I/O to complete, writing a translog checkpoint:

    0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[tomorrowland-scheduler-itcontracts-elasticsearch-4][bulk][T#3]'
     10/10 snapshots sharing following 39 elements
       sun.nio.ch.FileDispatcherImpl.force0(Native Method)
       sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76)
       sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:388)
       org.elasticsearch.index.translog.Checkpoint.write(Checkpoint.java:131)
       org.elasticsearch.index.translog.TranslogWriter.writeCheckpoint(TranslogWriter.java:312)

I suspect slow storage, because this shouldn't be taking ≥500ms to complete. Are you using local disks or network-attached storage? What does the latency on your indexing requests look like?

I also see a synced flush that's blocked on the translog lock. Synced flushes are triggered when a shard becomes inactive, or via the API. This shard doesn't look inactive, so I think this is coming from a client. Flushes are quite expensive, and rarely necessary - why are you triggering this? Also a synced flush is fairly pointless on an active shard since there's a good chance that it won't succeed on all shards and, even if it does, the sync marker will be wiped out on the next document that's indexed.


(Sumanth Reddy) #12

Hi David,

Sorry that i wasn't able to reply to this earlier.

We didn't initiate any synced flush.

You were right. We are using the NFS storage for our backends and probably the reason why some threads might have got stuck. We found that the Elasticsearch node container itself is getting stuck in that hosts and had to reboot VM's.

I'm marking your above reply as solution since it helped us figure out exactly which node has the issue.

Thank you. You were extremely helpful with proper diagnosis and timely replies.