Hi,
Just wondering if anyone else has struck the issue I have recently with Persistent Queue enabled and the pipeline sometimes getting into a unrecoverable state where it throws the message "cannot write to a closed queue".
We have java 14, logstash v7.9.1, on windows 2016.
Running as virtual machines, disks provided are nvme.
We have enabled the PQ retry option queue.checkpoint.retry
| , but that only retries once and after a 500ms delay.... no extra attempts or way to increase the delay from what i read in the source code.
The underlying disk performance is fine, very low latency of 1ms.
From time to time we have observed though that the windows file system sometimes rakes a few seconds to register a file change, or for example the rename of a file is fast most of the time then randomly it takes 2seconds. But you repeat it and its instantaneous.
Looking for ideas and hopefully im not insane and the only sufferer.
try {
Files.move(tmpPath, dirPath.resolve(fileName), StandardCopyOption.ATOMIC_MOVE);
} catch (IOException ex) {
if (retry) {
try {
logger.error("Retrying after exception writing checkpoint: " + ex);
Thread.sleep(500);
Files.move(tmpPath, dirPath.resolve(fileName), StandardCopyOption.ATOMIC_MOVE);
} catch (Exception ex2) {
logger.error("Aborting after second exception writing checkpoint: " + ex2);
throw ex;
}
} else {
logger.error("Error writing checkpoint: " + ex);
throw ex;
}
}
}
queue.checkpoint.retry
| When enabled, Logstash will retry once per attempted checkpoint write for any checkpoint writes that fail. Any subsequent errors are not retried. This is a workaround for failed checkpoint writes that have been seen only on filesystems with non-standard behavior such as SANs and is not recommended except in those specific circumstances.
For what its worth, we have less issues running java 8, logstash v7.7.0.