Endpoint Security Data (Rollup?)

Hi,

Hopefully this is the right place to post; if not, please move/direct me accordingly, :slight_smile:

I've enabled endpoint security on some of our systems, running under a fleet controlled elastic agent.

Initially the VM I was testing with was getting hammered by IO occasionally after setting up endpoint, I partially attributed this to it being still partially a testing setup (thus running two docker Elasticsearch instances on a single node) and the VM disk being rotational.

Switched the VM over to SSD and haven't had an issue since.

However; of course, the base issue is the massive ingest of data related to endpoint security and of course SSD storage is more costly. (I am aware that in my current semitest configuration that I'm semi-needlessly duplicating data; the second instance and it's data will be moved eventually to a different vm)

The insight and metrics that it provides is great and I'm hoping to be able to make good use of it, hopefully including some anomaly detection and alerts.

Our business is in an industry (financial) which places a high value on security and being able to audit previous actions is important as well.

We need to be able to balance insight and auditability with reasonable storage requirements. I'm not 100% sure of all the tools available for doing this, specifically in regards to endpoint security. (While system metrics and docker metrics (also running w/elastic agent managed by fleet) also are a bit heavy, endpoint seems to be much more)

Speaking of system metrics/docker metrics, I'm also not entirely sure if some data may be duplicated here from endpoint; specifically, say, process data I think is pushed by metrics and also endpoint, I'm not sure if this is the case and if so if it can be de-duped reasonably without affecting the dashboards for example.

In my search it seemed like rollups could be a potential solution. Filtering also came up in my search but then we're just losing data. Is there another way to perhaps lower the sample rate of the data at times or equivalently filter older data to reduce it's fidelity/sample count? I'm wondering if perhaps some metrics (whether pushed by endpoint or docker/system metrics) have a higher sample rate than is needed.

Rollups seem to do what I want, at least possibly in the respect of the ability to limit the size/fidelity of older/less significant data.

I went to try and create a rollup in Kibana for endpoint security and wasn't completely sure how to proceed.

First, other than 'metrics-endpoint.metadata_current_default' which is only 14docs/148kb, the only indexes that seem to relate to endpoint are 'hidden'.

If I show hidden indexes, I see '.ds-logs-endpoint.events-file-default-2022.05-27-000001', '.ds-logs-endpoint.events.process-default-2022.05.27-000001' and '.ds-logs-endpoint.events.network-default-2022.05.27-000001' as the biggest indexes with the highest docs counts (also, '.ds-metrics-system.process-default-2022.05.27-000001' but this is more related to system metrics I beleive?)

Initially I went to put in endpoint as an index pattern but then I wasn't sure; I'm not entirely sure how this will affect my ability to view data in Kibana under security. I'm also not 100% sure exactly how/if I'm able to specify it to only roll up older data.

In the rollup docs, it says 'we’d like to rollup these documents into hourly summaries, which will allow us to generate reports and dashboards with any time interval one hour or greater.' which sounds pretty close to what I'd like to do, at least on older data but I'm not entirely clear if this will affect existing dashboards or if new ones must be created/adjusted to read the rollup index or if that's the same for the views under Security in Kibana?

Thanks for your help;

  • one running out of space admin!

Also wondering with regards to this kinda stuff; many indexes seem to have a date on them which seems to refer to the date they began; I'm not entirely clear on what triggers creation of a new one (since I've done a lot of reconfiguring and restarting etc) whether just restarting does or what and how this might affect my goals.

I also seem to be getting errors like this on my rules..? Not sure why..? It seems like rules are no longer executing properly..? Tried deleting and re-adding the endpoint security integration for my fleet servers policy at least, didn't seem to help?

An error occurred during rule execution: message: "search_phase_execution_exception: "

Also tried adding an (external) windows machine and it enrolled fine but went 'unhealthy' during updates; figured maybe this was an external IP issue and added 127.0.0.1 to both fleet server hosts and outputs (and the windows machine is tunneled to the ES server, so 9200/8220 on it's localhost is the same as the Elasticsearch server)

Hey @forbiddenera ,

So, you have a couple of options:

  • I wouldn't recommend rollups for endpoint data. That functionality is targeted at metrics (numbers, etc). You might want to look into transforms, which allow you to periodically group and aggregate data into a very lightweight index, and you can potentially discard the original data.
  • You can use logstash with fleet as of 8.2 (in beta), this will allow you to aggregate at the logstash level, but, also, you can selectively decide to send some data to very cheap storage, such as S3
  • We have an enterprise feature called searchable snapshots, which essentially allow you to store data on cloud storage (or other cheap storage medium), but still search it as if it was on disk (with an impact on search speed, but sometimes this is a great tradeoff for older data

That error you highlighted is because you cluster is overloaded, and cannot keep up with the search demand, so your rules are timing out.

Happy to continue to discuss here - but you might find our community slack easier - ela.st/slack

James

I've joined the slack; same name if you want to continue w/me on there.

I'm not sure the reason for the error is what you mentioned, I don't see any other signs of ES being overloaded; vm load averages are within reason, io isn't overloaded. Previously it seemed to be working fine with no errors for a few days with the same load.

All rules say last updated 4 hours ago, which is when I removed the rules and integration and re-added.

I also didn't see index lifecycle policies yet before making my post either; this seems like it could be useful for my goals as well. Forgive my n00b, I haven't touched elastic in years and there's a lot more here now.

None of the rules had updated in days until I removed all rules and reinstalled the 'first' endpoint security integration I had installed. Quite a few rules are erroring and getting different errors:

An error occurred during rule execution: message: "linux_rare_kernel_module_arguments missing"
An error occurred during rule execution: message: "linux_rare_metadata_process,v2_linux_rare_metadata_process missing"
An error occurred during rule execution: message: "linux_rare_user_compiler missing"
An error occurred during rule execution: message: "packetbeat_rare_urls missing" name: "Unusual Web Request"
An error occurred during rule execution: message: "high_count_network_denies missing" name: "Spike in Firewall Denies"
An error occurred during rule execution: message: "packetbeat_rare_user_agent missing" name: "Unusual Web User Agent"
An error occurred during rule execution: message: "linux_rare_kernel_module_arguments missing" name: "Anomalous Kernel Module Activity"
An error occurred during rule execution: message: "windows_rare_metadata_process,v2_windows_rare_metadata_process missing" name: "Unusual Windows Process Calling the Metadata Service"
An error occurred during rule execution: message: "linux_anomalous_user_name_ecs,v2_linux_anomalous_user_name_ecs missing" name: "Unusual Linux Username"
An error occurred during rule execution: message: "linux_anomalous_network_port_activity_ecs,v2_linux_anomalous_network_port_activity_ecs missing" name: "Unusual Linux Network Port Activity"
An error occurred during rule execution: message: "windows_anomalous_process_creation,v2_windows_anomalous_process_creation missing"

Currently there is no Windows agents either.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.