IOException: Too many open files - TCP input

flalar · November 14, 2017, 12:41pm

We're getting the following exception when using Logstash tcp input. Elastic stack running 5.6.4 on Centos 7.4

[2017-11-10T23:59:58,325][WARN ][io.netty.channel.DefaultChannelPipeline] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) ~[?:1.8.0_151]
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) ~[?:1.8.0_151]
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) ~[?:1.8.0_151]
at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135) ~[netty-all-4.1.3.Final.jar:4.1.3.Final]
at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:75) [netty-all-4.1.3.Final.jar:4.1.3.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:571) [netty-all-4.1.3.Final.jar:4.1.3.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:512) [netty-all-4.1.3.Final.jar:4.1.3.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:426) [netty-all-4.1.3.Final.jar:4.1.3.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:398) [netty-all-4.1.3.Final.jar:4.1.3.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:877) [netty-all-4.1.3.Final.jar:4.1.3.Final]
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) [netty-all-4.1.3.Final.jar:4.1.3.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]

When we start seeing these exceptions the TCP input seems to shut down. No events are at least prosessed.

Running lsof | grep "logstash.*TCP" | wc -l shows an increasing number of open files

lsof | grep "logstash.*TCP" shows lots of files looking like this
.monitor 14308 14358 logstash 999u IPv6 1331290508 0t0 TCP servername.domain.local:port->clientip:port (ESTABLISHED)

Seems connections get stuck in an established state. Netstat shows no connections stuck in close wait state.

On the client, connections seems to be closed appropriately and they only report a few open connections.

Any tips how how to solve this?

nugusbayevkk · November 15, 2017, 12:29pm

hello, try to increase max open files on your OS by note described in https://www.tecmint.com/increase-set-open-file-limits-in-linux/

flalar · November 20, 2017, 7:16pm

Thanks, but that would not fix the problem if it just keeps growing would it? How many open files is a reasonable amount?

I'm not to familiar with linux configuration but lsof | grep "logstash.*TCP" | wc -lt command returns over 100.000 when the mentioned exception starts showing up. However from what I understand logstash has the default limit for open files of 16384. Maybe the first command does something else than what i think.

Jose_Ramierez · November 27, 2017, 3:29am

Ahhhh, I thought I was the only one!

We see our logstash come down, with the same log error.

Before the netty exception fires in the log (in fact, for hours before the netty exception appears in the log), we get stacks of these in the log:

[2017-11-24T18:12:59,811][ERROR][logstash.outputs.elasticsearch] Encountered a retryable error. Will Retry with exponential backoff {:code=>400, :url=>"http://<IP of Elasticsearch>:9200/_bulk"}

Then we get the netty error.

How do get guidance on:

a) What logstash monitor url can show us how many open files logstash is holding?

b) What monitor urls can see that logstash is in this hosed state?

c) Aside from the OS config, what do we need to know to truly solve this?

Jose_Ramierez · November 27, 2017, 3:30am

Can you review my q's in IOException: Too many open files - TCP input and suggest how we can attack this from all relevant angles?

Thanks!

flalar · November 27, 2017, 7:19pm

Ok, so we changed the soft and hard limits for logstash user according to @nugusbayevkk suggestion to
logstash soft nofile 500000
logstash hard nofile 500000

however got the same IOException again and the TCP input dies and lsof -p LogstashPID| wc -l shows 16490. Which is the default

@Jose_Ramierez Apparently we had to change the config for the logstash service in systemd /etc/systemd/system/logstash.service and set LimitNOFILE=50000.

Now after a few days lsof -p LogstashPID| wc -l gives 38783. So the number of open files just keeps growing and growing and I expect the TCP Input too ultimately crash with the same exception.

To ensure this is an issue with the TCP input we tried to disable this input and after that open files was stable.

And btw we also tried to upgrade our full elastic stack to 6.0 without any resolution of this issue.

Jose_Ramierez · November 27, 2017, 9:30pm

From my observations, the above is not the key measure.

Here is what I am seeing:

[root@localhost ~]# curl -XGET 'localhost:9600/_node/stats/process?pretty'
{
  "host" : "localhost",
  "version" : "5.6.4",
  "http_address" : "127.0.0.1:9600",
  "id" : "82369e7f-0a82-4717-8944-5b9731d958ce",
  "name" : "logstashPROD",
  "process" : {
    "open_file_descriptors" : 14000,
    "peak_open_file_descriptors" : 14000,
    "max_file_descriptors" : 16384,
    "mem" : {
      "total_virtual_in_bytes" : 4886056960
    },
    "cpu" : {
      "total_in_millis" : 10504830,
      "percent" : 0,
      "load_average" : {
        "1m" : 0.0,
        "5m" : 0.01,
        "15m" : 0.0
      }
    }
 [ }
}

Note "open_file_descriptors" and the "peak_open_file_descriptors" , we see those spike over 1000 (when they are normally between 100 and 500). When that measure spikes to four or five digits, logstash is on its way down (or is already down)

When the above counter is at 5 digits, the "lsof -p LogstashPID| wc -l" command still returns a much smaller number (under 400).

For example, in the case above, logstash was down by the time that hit 10000. (Prob much earlier)

But I cannot yet figure out what to do from here.

It is clearly a logstash bug, but no one at elastic seems to care.

flalar · November 27, 2017, 9:43pm

What inputs are you using?

Jose_Ramierez · November 27, 2017, 10:30pm

Our only input is a TCP listener:

input {
  tcp {
    port => 1233
	codec => "line"
  }
}

nugusbayevkk · November 29, 2017, 7:04pm

@Jose_Ramierez what application is sended messages to your tcp port 1233?

Jose_Ramierez · November 29, 2017, 7:17pm

what application is sended messages to your tcp port 1233?

Can you help me understand why this is relevant?

(Anything can send messages there, they either parse or fail to parse, and that is that... per logstash design, right?)

I don't understand why the sending app has any meaning. I could be using curl, or a c++ thing, or anything to open a socket and send stuff.

nugusbayevkk · November 29, 2017, 7:21pm

Jose_Ramierez · November 29, 2017, 7:44pm

Thank you.

I still do not see how that is relevant to this problem.

Problem: logstash stops listening on 1233, without clear errors.

If clients (the systems sending data to 1233) failed to close sockets, logstash would eventually run out of resources, but that should be logged properly, right?

The clients are remote, not on same machine.

OK, I am going to do some investigation using this possibility.

Jose_Ramierez · November 29, 2017, 7:44pm

Let me rephrase this.

What command would I use on the logstash box to see that remote systems are holding open sockets and not closing them?

nugusbayevkk · November 29, 2017, 8:17pm

i test this case,

here some java tcp client:

public class SomeApp {
public static void main(String args) throws IOException {
for (int i=0; i<100; i++) {
Socket socket = new Socket("localhost", 31311);
DataOutputStream dos = new DataOutputStream(socket.getOutputStream());
doc.writeUTF("hi i am test bot");
doc.flush();
// socket.close();
}
}
}

with this case count of file descriptors will be increasing .
if you uncomment socket.close(), so open file descriptors won't increase.

nugusbayevkk · November 29, 2017, 8:23pm

you can use
lsof -p LogstashPID| wc -l
as described earlier , run with crontab, if count reach or exceed some threshold, you can alert.

if you have some monitorinfg system, like as example, zabbix, you can configure zabbix trapper and configure there some alert rules

flalar · November 30, 2017, 1:53pm

So, I was today debugging this issue with tcpdump and saw that no FIN or RST was received on the logstash side. On the sending side from F5 BigIp they did not see any connection leakage and everything seems normal. So an intermediary was causing problems.

Further investigations with our network team uncovered that some packets were being blocked in the firewall. Apparently the connections was first set up as syslog, and then further down the line when switching to raw tcp the firewall was not updated. This lead to the data was being received but the FIN packets was not received as they were blocked.

After allowing any protocol on the designated port, not only syslog , the connection leak is fixed!!!

Jose_Ramierez · December 13, 2017, 9:50pm

We used this command on the server:

ls -al /proc/<PID>/fd |wc -l

To see that the file/handle count was out of control.
In our case, our code that pushes data into logstash had a flaw and opened too many socket connections.
Solved!

system · January 10, 2018, 9:51pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Too many open files. Logstash 6.1 Logstash	2	2646	January 25, 2018
TCP Input "Too many files open" IOException Logstash	6	968	February 3, 2022
Logstash crashing with "Too many files open" Logstash	9	6573	July 6, 2017
Logstash Too many open files Logstash	8	10077	July 6, 2017
Too many open files in Elasticsearch Elasticsearch	4	2066	July 5, 2017

IOException: Too many open files - TCP input

Related topics