Why does Logstash not restart from JDBC I/O errors?


#1

For some reason lately I've been experiencing JDBC input I/O errors in Logstash's logs, but Logstash evidently doesn't do anything when it encounters these. It just hangs and the pipeline can't continue anymore. I have to manually check if it's still processing and when I see Elasticsearch isn't indexing anymore I go back to Logstash, see the I/O error log, restart it and it works for another few million docs or so.

Couldn't Logstash see these errors and restart the connection automatically when it experiences them?


(Guy Boertje) #2

What version of the JDBC input are you using? bin/logstash-plugin list --verbose | grep jdbc or similar.

This version 4.3.0 was refactored to open and close connections for each trip to the db (each scheduled run)


JDBC can't connect anymore for no apparent reason
#3

logstash-filter-jdbc_streaming (1.0.3)
logstash-input-jdbc (4.3.2)

I could be wrong but I feel like this I/O error doesn't break the connection or something. I think I remember seeing it was still connected to the database I'm pulling from, but the logs were just stopped on that I/O error and nothing else was happening. I had to stop Logstash to kill the connection.


(Guy Boertje) #4

Is there any other backtrace info in the logs?
Is there any indication in the Postgres logs that it restarted a little bit before the log message.
Is there a network proxy or load balancer between LS and the DB server?

Some prior art:
Caused by: java.io.EOFException
https://www.postgresql.org/message-id/001701c53a90$cf879cc0$6600a8c0@kermit
https://www.postgresql.org/message-id/42540D4D.2000504%40hogranch.com
Caused by: java.net.SocketException: Socket closed


https://confluence.atlassian.com/confkb/getting-a-psqlexception-error-an-io-error-occured-while-sending-to-the-backend-due-to-corrupt-postgres-database-162988249.html -> advice: check that the postgres db is not crashing (hw failure)

You could try to set jdbc_properties via the sequel_opts setting...

  sequel_opts => {
    "jdbc_properties" => {
      "socketTimeout" => 20 #seconds
      "tcpKeepAlive" => true
    }
  }

See https://jdbc.postgresql.org/documentation/94/connect.html


#5

Is there any other backtrace info in the logs?

I've included our Postgres logs below.

Is there any indication in the Postgres logs that it restarted a little bit before the log message.

I don't think so, but I'm not sure. The logs I've included are where the errors start.

Is there a network proxy or load balancer between LS and the DB server?

No

2018-01-08 16:17:30 UTC:removed[8341]:LOG: server process (PID 21535) was terminated by signal 11: Segmentation fault

2018-01-08 16:17:30 UTC:removed[8341]:LOG: terminating any other active server processes

2018-01-08 16:17:30 UTC:removed(48672):removed@removed:[10278]:warning: terminating connection because of crash of another server process

2018-01-08 16:17:30 UTC:removed(48672):removed@removed:[10278]:DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

2018-01-08 16:17:30 UTC:removed(48672):removed@removed:[10278]:HINT: In a moment you should be able to reconnect to the database and repeat your command.

repeat above logs a bunch of times

2018-01-08 16:17:30 UTC:removed[8341]:LOG: archiver process (PID 26438) exited with exit code 1

2018-01-08 16:17:30 UTC:removed(53072):removed@removed:[21450]:warning: terminating connection because of crash of another server process

2018-01-08 16:17:30 UTC:removed(53072):removed@removed:[21450]:DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

2018-01-08 16:17:30 UTC:removed(53072):removed@removed:[21450]:HINT: In a moment you should be able to reconnect to the database and repeat your command.

then repeat below log a bunch of times

2018-01-08 16:17:30 UTC:removed(53166):removed@removed:[21537]:FATAL: the database system is in recovery mode

Do you think that this log:

2018-01-08 16:17:30 UTC:removed(53072):removed@removed:[21450]:DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

could be because I often run sudo systemctl stop logstash and maybe the Postgres database couldn't stop the large query very quickly so the processed was killed in a bad way (or something)?


(Guy Boertje) #6

Well it does look like your Postgres db is segfaulting.

Sequel uses a connection pool under the hood so I guess the jdbc input's connection is a pool connection lease. I think it is these connections that are throwing the PGException indirectly.


#7

I'm not familiar with the concept of segfaulting. Is that something caused by a bug in Postrgesql, or AWS RDS, or is it because of some misconfiguration on our part? Or it normal to see this if you overload the database somehow?


(Guy Boertje) #8

An abnormal obscure bug of some kind in Postgres I think. How often does it happen?

I am not a DBA type expert.

Try to upgrade to a more recent version of PG and the driver too.

Other than that, you will need to put your Sherlock Holmes hat on and search the internet.


#9

okay thank you


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.