Hi,
Ever since migrating from an in-house developed Check Point integration to the official Check Point integration for Elastic Agent, we've been getting millions of version_conflict_engine_exception
errors every day.
Previously we were writing directly to an Elasticsearch index and were able to update existing documents. This approach no longer works, since Elastic Agent uses data streams which are (mostly) append-only.
The most relevant part of the error messages caused by the Elastic Agent Check Point integration is as follows:
{
"create"=>{
"status"=>409,
"error"=>{
"type"=>"version_conflict_engine_exception",
"reason"=>"[<SHA 1 fingerprint>]: version conflict, document already exists (current version [1])",
"index_uuid"=>"<indexUuid>",
"shard"=>"0",
"index"=>".ds-logs-checkpoint.firewall-default-2024.06.09-000036"
}
}
}
From what I can tell, this might be due to a misunderstanding of how the semi-unified mode
in Check Point's Log Exporter works and/or how the fingerprint is configured and which timestamps Check Point uses in the updates.
The Setup instructions for the Elastic Agent integration specifically mentions this setting to avoid fingerprint collisions:
In some instances firewall events may have the same Checkpoint
loguid
and arrive during the same timestamp resulting in a fingerprint collision. To avoid this enable semi-unified logging[1] in the Checkpoint dashboard.
Check Point's official Log Exporter documentation[1] seems to have a different explanation of this setting, however:
Log Exporter's new semi-unified mode correlates all previous logs into one, so the latest log always shows the complete data.
My theory is that the Elastic Agent integration assumes that only a single log entry will be sent for every unique loguid
once semi-unified mode
is enabled, while Check Point will always send updates - regardless of this setting. The difference is that all updates will contain the complete log message when semi-unified mode
is enabled, while only the deltas will be sent otherwise - leaving it to the receiving system to reconstruct the complete log entry from the original message plus all deltas.
Part of the reason for this confusion, assuming this theory is correct, might be that Check Point re-uses the syslog timestamp from the original message for the updates it sends, even hours later.
From a quick diff between a randomly selected initial raw syslog message and a corresponding updated raw syslog message (which was dropped due to indexing conflict), all key/value pairs are identical except for the following keys, which contains updated numerical values:
sequencenum
aggregated_log_count
connection_count
duration
last_hit_time
lastupdatetime
update_count
I'm not sure if this integration can ever be compatible with data streams without either ignoring all updates or forcing them to get a unique _id
(and accepting that there will now be duplicates instead of conflict errors), but the way the integration works now creates a lot of noise.
When comparing conflict errors with the number of correctly ingested documents over the last 24 hours, almost 20% of the write attempts we saw resulted in conflicts - but this will obviously vary a lot based on session lengths and network usage patterns.
If anyone has any ideas for how to solve this, or are aware of any settings I've missed in either Check Point or Elastic Agent, I'd be immensely grateful for any suggestions.
Since I was unable to post while linking to Check Point's official site, here is the relevant documentation:
[1] https://sc1.checkpoint.com/documents/R81/WebAdminGuides/EN/CP_R81_LoggingAndMonitoring_AdminGuide/Topics-LMG/Log-Exporter-Appendix.htm?TocPath=Log%20Exporter%7C_____9