IOException: Too many open files - TCP input

We're getting the following exception when using Logstash tcp input. Elastic stack running 5.6.4 on Centos 7.4

[2017-11-10T23:59:58,325][WARN ][io.netty.channel.DefaultChannelPipeline] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) ~[?:1.8.0_151]
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) ~[?:1.8.0_151]
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) ~[?:1.8.0_151]
at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135) ~[netty-all-4.1.3.Final.jar:4.1.3.Final]
at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:75) [netty-all-4.1.3.Final.jar:4.1.3.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:571) [netty-all-4.1.3.Final.jar:4.1.3.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:512) [netty-all-4.1.3.Final.jar:4.1.3.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:426) [netty-all-4.1.3.Final.jar:4.1.3.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:398) [netty-all-4.1.3.Final.jar:4.1.3.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:877) [netty-all-4.1.3.Final.jar:4.1.3.Final]
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) [netty-all-4.1.3.Final.jar:4.1.3.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]

When we start seeing these exceptions the TCP input seems to shut down. No events are at least prosessed.

Running lsof | grep "logstash.*TCP" | wc -l shows an increasing number of open files

lsof | grep "logstash.*TCP" shows lots of files looking like this
.monitor 14308 14358 logstash 999u IPv6 1331290508 0t0 TCP servername.domain.local:port->clientip:port (ESTABLISHED)

Seems connections get stuck in an established state. Netstat shows no connections stuck in close wait state.

On the client, connections seems to be closed appropriately and they only report a few open connections.

Any tips how how to solve this?

2 Likes

hello, try to increase max open files on your OS by note described in https://www.tecmint.com/increase-set-open-file-limits-in-linux/

Thanks, but that would not fix the problem if it just keeps growing would it? How many open files is a reasonable amount?

I'm not to familiar with linux configuration but lsof | grep "logstash.*TCP" | wc -lt command returns over 100.000 when the mentioned exception starts showing up. However from what I understand logstash has the default limit for open files of 16384. Maybe the first command does something else than what i think.

Ahhhh, I thought I was the only one!

We see our logstash come down, with the same log error.

Before the netty exception fires in the log (in fact, for hours before the netty exception appears in the log), we get stacks of these in the log:

[2017-11-24T18:12:59,811][ERROR][logstash.outputs.elasticsearch] Encountered a retryable error. Will Retry with exponential backoff {:code=>400, :url=>"http://<IP of Elasticsearch>:9200/_bulk"}

Then we get the netty error.

How do get guidance on:

a) What logstash monitor url can show us how many open files logstash is holding?

b) What monitor urls can see that logstash is in this hosed state?

c) Aside from the OS config, what do we need to know to truly solve this?

Can you review my q's in IOException: Too many open files - TCP input and suggest how we can attack this from all relevant angles?

Thanks!

Ok, so we changed the soft and hard limits for logstash user according to @nugusbayevkk suggestion to
logstash soft nofile 500000
logstash hard nofile 500000

however got the same IOException again and the TCP input dies and lsof -p LogstashPID| wc -l shows 16490. Which is the default

@Jose_Ramierez Apparently we had to change the config for the logstash service in systemd /etc/systemd/system/logstash.service and set LimitNOFILE=50000.

Now after a few days lsof -p LogstashPID| wc -l gives 38783. So the number of open files just keeps growing and growing and I expect the TCP Input too ultimately crash with the same exception.

To ensure this is an issue with the TCP input we tried to disable this input and after that open files was stable.

And btw we also tried to upgrade our full elastic stack to 6.0 without any resolution of this issue.

From my observations, the above is not the key measure.

Here is what I am seeing:

[root@localhost ~]# curl -XGET 'localhost:9600/_node/stats/process?pretty'
{
  "host" : "localhost",
  "version" : "5.6.4",
  "http_address" : "127.0.0.1:9600",
  "id" : "82369e7f-0a82-4717-8944-5b9731d958ce",
  "name" : "logstashPROD",
  "process" : {
    "open_file_descriptors" : 14000,
    "peak_open_file_descriptors" : 14000,
    "max_file_descriptors" : 16384,
    "mem" : {
      "total_virtual_in_bytes" : 4886056960
    },
    "cpu" : {
      "total_in_millis" : 10504830,
      "percent" : 0,
      "load_average" : {
        "1m" : 0.0,
        "5m" : 0.01,
        "15m" : 0.0
      }
    }
 [ }
}

Note "open_file_descriptors" and the "peak_open_file_descriptors" , we see those spike over 1000 (when they are normally between 100 and 500). When that measure spikes to four or five digits, logstash is on its way down (or is already down)

When the above counter is at 5 digits, the "lsof -p LogstashPID| wc -l" command still returns a much smaller number (under 400).

For example, in the case above, logstash was down by the time that hit 10000. (Prob much earlier)

But I cannot yet figure out what to do from here.

It is clearly a logstash bug, but no one at elastic seems to care.

What inputs are you using?

Our only input is a TCP listener:

input {
  tcp {
    port => 1233
	codec => "line"
  }
}

@Jose_Ramierez what application is sended messages to your tcp port 1233?

what application is sended messages to your tcp port 1233?

Can you help me understand why this is relevant?

(Anything can send messages there, they either parse or fail to parse, and that is that... per logstash design, right?)

I don't understand why the sending app has any meaning. I could be using curl, or a c++ thing, or anything to open a socket and send stuff.

Thank you.

I still do not see how that is relevant to this problem.

Problem: logstash stops listening on 1233, without clear errors.

If clients (the systems sending data to 1233) failed to close sockets, logstash would eventually run out of resources, but that should be logged properly, right?

The clients are remote, not on same machine.

OK, I am going to do some investigation using this possibility.

Let me rephrase this.

What command would I use on the logstash box to see that remote systems are holding open sockets and not closing them?

i test this case,

here some java tcp client:

public class SomeApp {
public static void main(String args) throws IOException {
for (int i=0; i<100; i++) {
Socket socket = new Socket("localhost", 31311);
DataOutputStream dos = new DataOutputStream(socket.getOutputStream());
doc.writeUTF("hi i am test bot");
doc.flush();
// socket.close();
}
}
}

with this case count of file descriptors will be increasing .
if you uncomment socket.close(), so open file descriptors won't increase.

you can use
lsof -p LogstashPID| wc -l
as described earlier , run with crontab, if count reach or exceed some threshold, you can alert.

if you have some monitorinfg system, like as example, zabbix, you can configure zabbix trapper and configure there some alert rules

So, I was today debugging this issue with tcpdump and saw that no FIN or RST was received on the logstash side. On the sending side from F5 BigIp they did not see any connection leakage and everything seems normal. So an intermediary was causing problems.

Further investigations with our network team uncovered that some packets were being blocked in the firewall. Apparently the connections was first set up as syslog, and then further down the line when switching to raw tcp the firewall was not updated. This lead to the data was being received but the FIN packets was not received as they were blocked.

After allowing any protocol on the designated port, not only syslog , the connection leak is fixed!!!

We used this command on the server:

ls -al /proc/<PID>/fd |wc -l

To see that the file/handle count was out of control.
In our case, our code that pushes data into logstash had a flaw and opened too many socket connections.
Solved!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.