Check Point integration incompatible with data streams, resulting in version_conflict_engine_exception

Hi,

Ever since migrating from an in-house developed Check Point integration to the official Check Point integration for Elastic Agent, we've been getting millions of version_conflict_engine_exception errors every day.

Previously we were writing directly to an Elasticsearch index and were able to update existing documents. This approach no longer works, since Elastic Agent uses data streams which are (mostly) append-only.

The most relevant part of the error messages caused by the Elastic Agent Check Point integration is as follows:

{
	"create"=>{
		"status"=>409,
		"error"=>{
			"type"=>"version_conflict_engine_exception",
			"reason"=>"[<SHA 1 fingerprint>]: version conflict, document already exists (current version [1])",
			"index_uuid"=>"<indexUuid>",
			"shard"=>"0",
			"index"=>".ds-logs-checkpoint.firewall-default-2024.06.09-000036"
		}
	}
}

From what I can tell, this might be due to a misunderstanding of how the semi-unified mode in Check Point's Log Exporter works and/or how the fingerprint is configured and which timestamps Check Point uses in the updates.

The Setup instructions for the Elastic Agent integration specifically mentions this setting to avoid fingerprint collisions:

In some instances firewall events may have the same Checkpoint loguid and arrive during the same timestamp resulting in a fingerprint collision. To avoid this enable semi-unified logging[1] in the Checkpoint dashboard.

Check Point's official Log Exporter documentation[1] seems to have a different explanation of this setting, however:

Log Exporter's new semi-unified mode correlates all previous logs into one, so the latest log always shows the complete data.

My theory is that the Elastic Agent integration assumes that only a single log entry will be sent for every unique loguid once semi-unified mode is enabled, while Check Point will always send updates - regardless of this setting. The difference is that all updates will contain the complete log message when semi-unified mode is enabled, while only the deltas will be sent otherwise - leaving it to the receiving system to reconstruct the complete log entry from the original message plus all deltas.

Part of the reason for this confusion, assuming this theory is correct, might be that Check Point re-uses the syslog timestamp from the original message for the updates it sends, even hours later.

From a quick diff between a randomly selected initial raw syslog message and a corresponding updated raw syslog message (which was dropped due to indexing conflict), all key/value pairs are identical except for the following keys, which contains updated numerical values:

  • sequencenum
  • aggregated_log_count
  • connection_count
  • duration
  • last_hit_time
  • lastupdatetime
  • update_count

I'm not sure if this integration can ever be compatible with data streams without either ignoring all updates or forcing them to get a unique _id (and accepting that there will now be duplicates instead of conflict errors), but the way the integration works now creates a lot of noise.
When comparing conflict errors with the number of correctly ingested documents over the last 24 hours, almost 20% of the write attempts we saw resulted in conflicts - but this will obviously vary a lot based on session lengths and network usage patterns.

If anyone has any ideas for how to solve this, or are aware of any settings I've missed in either Check Point or Elastic Agent, I'd be immensely grateful for any suggestions.

Since I was unable to post while linking to Check Point's official site, here is the relevant documentation:

[1] https://sc1.checkpoint.com/documents/R81/WebAdminGuides/EN/CP_R81_LoggingAndMonitoring_AdminGuide/Topics-LMG/Log-Exporter-Appendix.htm?TocPath=Log%20Exporter%7C_____9

I do not use the Checkpoint integration, but the logic to generate the fingerprint that is used as the _id is this one:

  # Some log events lack loguid and time, so to avoid potential
  # collisions hash the complete line in those rare cases.
  - fingerprint:
      if: ctx.checkpoint?.loguid == null && ctx.checkpoint?.time == null
      fields:
        - event.original
      target_field: "_id"
      ignore_missing: true
  - fingerprint:
      if: ctx._id == null
      fields:
        - '@timestamp'
        - checkpoint.loguid
        - checkpoint.time
        - checkpoint.segment_time
      target_field: "_id"
      ignore_missing: true

It seems that only the fields @timestamp, checkpoint.loguid, checkpoint.time and checkpoint.segment_time are being used to generate the fingerprint.

Maybe adding checkpoint.lastupdatetime in the list would solve this issue?

I would suggest the you open a github issue about it in the integrations repository, GitHub - elastic/integrations: Elastic Integrations

Hi Leandro,

Thank you for the impressively quick reply! Much appreciated.

I could definitely override the fingerprint in a custom pipeline to get rid of the error messages, something I did consider as a temporary workaround, but that doesn't really solve the underlying issue of data streams being incompatible with the apparent need to update existing documents.

If I'm not mistaken, by forcing unique fingerprints by including checkpoint.lastupdatedtime the errors would be gone but various new issues would be introduced instead:

  • There would now be duplicate documents of the same session, meaning that all statistics and dashboards would have misleading and inflated numbers, unless all queries make sure to only fetch the most recent document for every unique event.id (copied from loguid in the raw syslog message)
  • Any relevant detection rules would potentially trigger multiple alerts for the same session, for example if suspicious/malicious IP addresses appear in the logs
  • Disk usage and insert-related resource usage would increase by a potentially substantial amount (~20% in my cursory benchmark), despite only a fraction of the document being changed

Since firewall / VPN logs are usually among the more noisy logs in a lot of environments, this would lead to a fairly noticeable increase in cost and extra load on the ingestion nodes, possibly necessitating a bump to a more expensive tier for Elastic Cloud customers.

Since queries would have to be customized to remove duplicates anyway, it might actually make more sense to disable semi-unified mode in this scenario, and merge the original documents with the most recent delta-update instead, to at least minimise the impact on the amount of additional storage required.

All that to say, you're probably right - I suppose a raising a GitHub issue for this might be necessary. I haven't been able to come up with any good way to deal with this issue without either writing updates directly to the Elasticsearch index, or accepting the drawbacks introduced by indexing duplicate documents.

As it stands right now, I'm leaning towards silently dropping the updates as the temporary workaround and accepting that the numbers will be off for any affected sessions.

Thanks again for your input, though! I'm still very interested in any additional thoughts or suggestions you or others might have, especially since I very well might have made a number of hasty assumptions.

I think thare is a little of confusion here, data streams are not incompatible with updating existing documents, the goal of the fingerprint processor is exactly this, create a unique custom id to avoid duplicates, this is used in multiple integrations.

The issue you had is related to trying to update a document with the same _id before the previous request was completed, this can happen for example if multiple documents on the same batch request result on the same fingerprint.

Adding the lastupdatetime in the fingerprint processor would reduce those cases.

The main issue here is on Checkpoint side, on how it generates the logs and there is not much you can do on Elastic side.

I also collect logs from multiple firewall devices and since they are noisy and can take a lot of space, you need to make some choices according to your requirements.

For example, I collect logs from Fortigate with Logstash and I choose to log both the session opening and the session close, so for the same session I will have at least 2 logs, but I will also have extra log lines while the session is active with delta information about the session.

For some devices I choose to collect only session close logs.

In your case I would disable this semi-unified mode and also adjust the fingerprint processor to use the last update time.

I've been experimenting a bit more, and it seems like the version_conflict_engine_exception warnings might be related to using Logstash as the output of Elastic Agent, and "proxying" all Elastic Agent requests through the Logstash server (with no further processing).

Unfortunately, using the Elasticsearch output in Elastic Agent doesn't solve the issue of being unable to update documents either, even though it does seem to get rid of the noise.

Being able to update existing documents when using data streams seems to contradict the documentation in Data streams | Elasticsearch Guide [8.14] | Elastic, and answers provided elsewhere on the forum.

I agree that the fingerprint processor used to work exactly like this when we were writing directly to an index (as opposed to a data stream), and I can confirm that I'm able to update existing documents if I send API requests to the underlying (active) index.

Any updates / changes to existing documents using the same _id (generated by the fingerprint processor) and written to the data stream (logs-checkpoint.firewall-default), however, are either silently ignored (if Elastic Agent uses the default Elasticsearch output) or creates version_conflict_engine_exception warnings (if Elastic Agent uses Logstash output).

I can also reproduce the issue if I send a manual Index API or Bulk API request (containing a single document) to the data stream.

No, this is most likely an unrelated issue. The updated logs often arrive hours - or even days - after the original document was successfully inserted, and I have confirmed this by manually generating a single "original" syslog message, waiting until it has been indexed (and appears in Kibana), and then manually generating a corresponding "updated" syslog message. Manual tests like these in an isolated environment (with no other activity) triggers the same version_conflict_engine_exception message.

It does look like Elastic Agent wouldn't have generated these errors if it was writing directly to Elasticsearch, though. I'm not yet sure if Elastic Agent does something clever to avoid generating the error message in the first place, or if it silently ignores these errors instead of logging them.

Out of curiosity, I also tested doing a manual rollover of the data stream after indexing an original document, before generating an updated syslog message. In this case, the updated document was now successfully written to the new index, since the _id was now "unique" in the new index. (But, of course, at this point I had two copies of the document, the original document in the old index and the updated document in the currently active index.)

Yes, but then the updates would no longer end up generating the same _id, so they would be considered a separate document containing a lot of duplicate data in addition to the values that were updated.
This would get rid of the errors, but would end up costing us a lot more (because of the increased disk usage), in addition to the other drawbacks such as having to manually merge the related documents or ignore all the older documents related to the same session.

I still suspect that all Elastic Agent integrations which depend on fingerprinting to create unique _ids are incompatible with log sources that are designed to trigger updates for existing documents, unless there are integrations that are able to write directly to the underlying index.

If the goal is to simply avoid duplicate copies and ignoring any updates to existing documents, fingerprinting combined with data streams works just fine (except for the noisy conflicts when using Logstash), but if the goal is to always keep the most recent version of a document, this is not the behaviour I would expect.