Hi, here I am again with some more data, actually connecting FB itself to elastic helped me getting more informations
Among other messages I've found some rare message of this type
Connecting error publishing events (retrying): dial tcp logstash_ip:5044: connectex: The remote system refused the network connection.
Failed to publish events caused by: read tcp filebeat_ip:1579->logstash_ip:5044: wsarecv: An established connection was aborted by the software in your host machine.
Failed to publish events caused by: read tcp filebeat_ip:1813->logstash_ip:5044: i/o timeout
They are very rare (about 10 in day) what are they meaning? are they meaningful?
I noticed that I get EOF only from one machine, I mean, during last 24h i got about 1000 EOF, the connection errors above came from all machines (they are rare just 14 in 24h and randomly distributed, I think they are comprehensible). on the other side EOF comes from one machine at a time. sometimes 1 or 2 errors comes from all of them, I think due to connection problem, again, comprehensible, but the majority 990 out of 1000 come from one single machine at a time: last night it was machine 14 producing EOF, night before it was machine 22, they always run away in the morning, when data they collect starts growing again.
Session 1 night between 12th and 13th of October
machine blue was throwing EOF at a rate of 35 in 15mins total 1000 in one night
Session 2, night between 13th and 14th
machine purple was throwing EOF at a rate of 30 in 15mins total 750 in one night
Then we restarted logstash so there is a little peak in the afternoon (comprehensible beacuse logstash was off for about 30 mins)
Session 3, 3 nights from 14th to 17th green machine was throwing 35EOF/15mins , for a total of 3000 EOF, about 1000 per night
Here the same chart with connection error instead of EOF
as you can see the machine that is having connection problem is the same of the EOF blue in session 1, purple in session 2, green in session 3 but I don't believe that connection errors are the source of the problem @steffens am I missing something? Is there any other check I can do to find useful information?
what does connecting FB to elastic mean exactly? Is there even a Logstash running on those machines?
The TCP RST from LS to FB is super interesting. In TCP there are mostly 3 messages types regarding connection state you may want to check for: SYN (synchronize, create connection), FIN (finish, teardown connection) and RST (reset, e.g. abort connection setup, data loss, invalid connection). You may want to look for these message from both endpoints (via tcpdump), filebeat and logstash. One likely issue might be filebeat sending data and gets ok from logstash. In meantime some device (or firewall rule) is closing the connection on LS side just before filebeat is trying to push another batch of events to LS. Thusly when filebeat tries to push, LS server will return RST packet, as the connection does not exist anymore on LS side => EOF in filebeat due to send on invalidated TCP connection. You should see a SYN packet shortly after the RST, which is filebeat trying to reconnect.
Comparing SYN/FIN/RST packets in dumps from both endpoints, you see exactly the same patterns of packet sends/receives in both dumps in about the same time?
sorry, bad wording: it means collect filebeat logs in elastic thus to use filebeat to open its log, pass them to logstash and then elastic
I will try to do that test, on LS after each RST there is a SYN
what I don't get is why considering that all machines are in the same network, (some of them are VM in the same host) that error only happens in one machine at a time, only during night, and it stops in the morning
what I don't get is why considering that all machines are in the same network
No one is assuming filebeat and LS being in same network. This is the exact reason we want to compare the TCP dump from both endpoints. Assuming they are not in the same network, we want to compare the dumps to check if someone connecting the networks/machines is doing something to the TCP connection states. Suddenly having a RST on an active connection could be a tell-tale sign the TCP state-machines of both hosts not being in sync, like one thinking the connection is closed and the other not.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.