Check Point integration incompatible with data streams, resulting in version_conflict_engine_exception

Hi,

Ever since migrating from an in-house developed Check Point integration to the official Check Point integration for Elastic Agent, we've been getting millions of version_conflict_engine_exception errors every day.

Previously we were writing directly to an Elasticsearch index and were able to update existing documents. This approach no longer works, since Elastic Agent uses data streams which are (mostly) append-only.

The most relevant part of the error messages caused by the Elastic Agent Check Point integration is as follows:

{
	"create"=>{
		"status"=>409,
		"error"=>{
			"type"=>"version_conflict_engine_exception",
			"reason"=>"[<SHA 1 fingerprint>]: version conflict, document already exists (current version [1])",
			"index_uuid"=>"<indexUuid>",
			"shard"=>"0",
			"index"=>".ds-logs-checkpoint.firewall-default-2024.06.09-000036"
		}
	}
}

From what I can tell, this might be due to a misunderstanding of how the semi-unified mode in Check Point's Log Exporter works and/or how the fingerprint is configured and which timestamps Check Point uses in the updates.

The Setup instructions for the Elastic Agent integration specifically mentions this setting to avoid fingerprint collisions:

In some instances firewall events may have the same Checkpoint loguid and arrive during the same timestamp resulting in a fingerprint collision. To avoid this enable semi-unified logging[1] in the Checkpoint dashboard.

Check Point's official Log Exporter documentation[1] seems to have a different explanation of this setting, however:

Log Exporter's new semi-unified mode correlates all previous logs into one, so the latest log always shows the complete data.

My theory is that the Elastic Agent integration assumes that only a single log entry will be sent for every unique loguid once semi-unified mode is enabled, while Check Point will always send updates - regardless of this setting. The difference is that all updates will contain the complete log message when semi-unified mode is enabled, while only the deltas will be sent otherwise - leaving it to the receiving system to reconstruct the complete log entry from the original message plus all deltas.

Part of the reason for this confusion, assuming this theory is correct, might be that Check Point re-uses the syslog timestamp from the original message for the updates it sends, even hours later.

From a quick diff between a randomly selected initial raw syslog message and a corresponding updated raw syslog message (which was dropped due to indexing conflict), all key/value pairs are identical except for the following keys, which contains updated numerical values:

  • sequencenum
  • aggregated_log_count
  • connection_count
  • duration
  • last_hit_time
  • lastupdatetime
  • update_count

I'm not sure if this integration can ever be compatible with data streams without either ignoring all updates or forcing them to get a unique _id (and accepting that there will now be duplicates instead of conflict errors), but the way the integration works now creates a lot of noise.
When comparing conflict errors with the number of correctly ingested documents over the last 24 hours, almost 20% of the write attempts we saw resulted in conflicts - but this will obviously vary a lot based on session lengths and network usage patterns.

If anyone has any ideas for how to solve this, or are aware of any settings I've missed in either Check Point or Elastic Agent, I'd be immensely grateful for any suggestions.

Since I was unable to post while linking to Check Point's official site, here is the relevant documentation:

[1] https://sc1.checkpoint.com/documents/R81/WebAdminGuides/EN/CP_R81_LoggingAndMonitoring_AdminGuide/Topics-LMG/Log-Exporter-Appendix.htm?TocPath=Log%20Exporter%7C_____9

I do not use the Checkpoint integration, but the logic to generate the fingerprint that is used as the _id is this one:

  # Some log events lack loguid and time, so to avoid potential
  # collisions hash the complete line in those rare cases.
  - fingerprint:
      if: ctx.checkpoint?.loguid == null && ctx.checkpoint?.time == null
      fields:
        - event.original
      target_field: "_id"
      ignore_missing: true
  - fingerprint:
      if: ctx._id == null
      fields:
        - '@timestamp'
        - checkpoint.loguid
        - checkpoint.time
        - checkpoint.segment_time
      target_field: "_id"
      ignore_missing: true

It seems that only the fields @timestamp, checkpoint.loguid, checkpoint.time and checkpoint.segment_time are being used to generate the fingerprint.

Maybe adding checkpoint.lastupdatetime in the list would solve this issue?

I would suggest the you open a github issue about it in the integrations repository, GitHub - elastic/integrations: Elastic Integrations

Hi Leandro,

Thank you for the impressively quick reply! Much appreciated.

I could definitely override the fingerprint in a custom pipeline to get rid of the error messages, something I did consider as a temporary workaround, but that doesn't really solve the underlying issue of data streams being incompatible with the apparent need to update existing documents.

If I'm not mistaken, by forcing unique fingerprints by including checkpoint.lastupdatedtime the errors would be gone but various new issues would be introduced instead:

  • There would now be duplicate documents of the same session, meaning that all statistics and dashboards would have misleading and inflated numbers, unless all queries make sure to only fetch the most recent document for every unique event.id (copied from loguid in the raw syslog message)
  • Any relevant detection rules would potentially trigger multiple alerts for the same session, for example if suspicious/malicious IP addresses appear in the logs
  • Disk usage and insert-related resource usage would increase by a potentially substantial amount (~20% in my cursory benchmark), despite only a fraction of the document being changed

Since firewall / VPN logs are usually among the more noisy logs in a lot of environments, this would lead to a fairly noticeable increase in cost and extra load on the ingestion nodes, possibly necessitating a bump to a more expensive tier for Elastic Cloud customers.

Since queries would have to be customized to remove duplicates anyway, it might actually make more sense to disable semi-unified mode in this scenario, and merge the original documents with the most recent delta-update instead, to at least minimise the impact on the amount of additional storage required.

All that to say, you're probably right - I suppose a raising a GitHub issue for this might be necessary. I haven't been able to come up with any good way to deal with this issue without either writing updates directly to the Elasticsearch index, or accepting the drawbacks introduced by indexing duplicate documents.

As it stands right now, I'm leaning towards silently dropping the updates as the temporary workaround and accepting that the numbers will be off for any affected sessions.

Thanks again for your input, though! I'm still very interested in any additional thoughts or suggestions you or others might have, especially since I very well might have made a number of hasty assumptions.

I think thare is a little of confusion here, data streams are not incompatible with updating existing documents, the goal of the fingerprint processor is exactly this, create a unique custom id to avoid duplicates, this is used in multiple integrations.

The issue you had is related to trying to update a document with the same _id before the previous request was completed, this can happen for example if multiple documents on the same batch request result on the same fingerprint.

Adding the lastupdatetime in the fingerprint processor would reduce those cases.

The main issue here is on Checkpoint side, on how it generates the logs and there is not much you can do on Elastic side.

I also collect logs from multiple firewall devices and since they are noisy and can take a lot of space, you need to make some choices according to your requirements.

For example, I collect logs from Fortigate with Logstash and I choose to log both the session opening and the session close, so for the same session I will have at least 2 logs, but I will also have extra log lines while the session is active with delta information about the session.

For some devices I choose to collect only session close logs.

In your case I would disable this semi-unified mode and also adjust the fingerprint processor to use the last update time.