Kibana 7.14.2 Event Filter drop fails to drop events

Kibana 7.14.2 is no longer dropping events that are in the events filter drop list. This is creating massive amounts of excessive unneeded data. 30+mil events per hour.

In kibana, Elastic Security, Event filters section.
Add Event Filter.
Enter a name.
In Field drop down select event.code.
Enter a value or 4690 "Windows Handle Manipulation"
Hit save.

Wait a min or two. In Logs-* search for event.code 4690 it will be present.

In 7.14.1 these were dropped accordingly updating to 7.14.2 results in all filters no longer being applied. Deleting and re-adding them results in no change.

@PublicName

Can you confirm that the Endpoint policy responses are successful in the Endpoints list? You can access this by navigating to "Security > Endpoints" and clicking on the Status pill under the "Policy Status" section in the Endpoints table.

If you see failures in the Policy response, can you see if it has to do with "User Artifacts" ?

As an example - here's a success message in the Events section in the Policy response in Endpoints list

Also, did you update your Endpoint to 7.14.2 or Kibana only?

Further, if you are using Trusted Applications, are these entries still working as expected? Event Filters and Trusted Applications are delivered to Endpoints along the same path so it could help debug further.

I have success down the board on all agents. Even restarted one of them just for giggles...

I have not updated the agents as of yet. As I'll keep saying please do not make agents depended on subversion :-). There needs to be wiggle room in versions that can talk. Larger deployments can't do things instantly. That is still an unreliable process on updates and it normal takes me a week to finish them off as some will fail to various reason so I avoid doing it the first few days after the elastic/kibana upgrade now.

Trusted app's on this dev cluster are empty so afraid I don't have insight on that part.

Is it possible to have Elastic drop the events as well not just the agent. Normally it's at the beats level or logstash but fleet is a little more sledgehammer less scalpel on deployments. Network bandwidth is not a concern for the most enterprise environments, if it is you should have built better. At a glance having just the agent seems like a rather fun way of doing a denial of service on a cluster as it's crazy easy to force the agent to send millions of events. Do that to 5 or 6 machines and unless your cluster is massively oversized it wont survive. Dropping before write at the elastic side wouldn't remove the CPU starvation issue with a huge ingest of events but would prevent running out of space and missing critical events.

From what it looks like this is just my setup.
Endpoint is set as

Normally I disable Network as the amount of data passed around is quite impressive but causes 99% of all captured data to be network.

This error from the logs-* may shed some light for you but it makes me scratch my head.

Event Code: 4568
error.message: Cannot invoke "java.util.Map.size()" because "m" is null

The handle to an object was closed.

Subject :
Security ID: REVOKED
Account Name: REVOKED$
Account Domain: REVOKED
Logon ID: 0x3E7

Object:
Object Server: Security
Handle ID: 0x3c4

Process Information:
Process ID: 0x2e08
Process Name: C:\Program Files\Elastic\Agent\data\elastic-agent-703d58\install\metricbeat-7.14.1-windows-x86_64\metricbeat.exe

Winlog.task: Registry

Server 2016 with 9-2021 patch's applied. It do not appear on the ones that haven't been updated this week and are on 8-2021. This has not been tested just observation.

The registry task has my attention as we have auditing enabled on reg keys directly. It would happen to be the very key that the 2 services are failing on would live. Metricbeat and "censored".

7.14.2. does not resolve the issue. After 30 minutes it came back.

Auto update failed on 3 of 33 agents. Wouldn't call this safe when you get to clusters running 400+ agents that is a pretty high failure rate to correct manually and during that time millions of events are crushing the cluster. I haven't looked into the failures yet but it's common to have failures.

Each machine that failed update is running 100% CPU utilization "Filebeat + Endpoint".
Attempting to unenroll agents. Agents are removed from fleet but endpoint and agent are not removed from the device. Filebeat fails to shutdown which then fails the rest of the uninstall.

@PublicName thank you for the detailed info.

I'll focus on the Event Filters in this post and pull in others who are more knowledgable on the Agent upgrade failures.

First, a few responses:

I have not updated the agents as of yet. As I'll keep saying please do not make agents depended on subversion :-).

Apologies on this, I should have been more clear. It's certainly not required to upgrade Agents every time, I was asking for debugging purposes.

Is it possible to have Elastic drop the events as well not just the agent.

As far as I know there is no functionality like this currently. Are you imagining some type of background process that deletes documents already written to ES based on a set of filters?

Regarding Event Filters - judging from the successful Policy Responses on your first post, it looks like they are successfully downloading your Event Filter list. Note that these Event Filters only apply to Endpoint events. Extending Event Filter functionality to the other subprocesses shipped by the Agent is something that we've discussed, but it's not currently implemented.

Can you confirm that the new Events coming in that should be filtered out come from Endpoints? In addition, they must be classified as event.kind: event to be picked up by filters. You can do this by confirming that the documents you see in logs-* contain agent.type: endpoint and event.kind: event.

There should be two sections in the doc:

  "agent": {
    "id": XXXXX,
    "type": "endpoint",
    "version": "7.14.1"
  },
...
  "event": {
   ...
    "kind": "event",
    "module": "endpoint",
   ...
  },

Alternatively, you can do this through the UI by going to "Security > Hosts > Events" page and adding a filter to the filter bar like this: agent.type : "endpoint" and event.kind : "event" and event.code : "4690".

Similar to this:

If you confirm that these documents are coming from the Endpoint and are Events, then it's possible the artifacts are not being created correctly. To check, you could go check the Policy yml of the Agent Policy you are using. If you open it up in the UI, you should see a section for the eventfilters list.

Similar to this:

Let me know what you find and I can check on a workaround.

That's if your ok if you want to keep the events coming into Elastic :slight_smile: For the security event log side here are my most common dropped either with event filter or from the legacy beats. 5156, 5157, 5158, 4658, 4656, 4690, 5152, 5447, 5152, 5154, 4663, 4703 the bulk of which are windows firewall events. Why? Because they are absolutely useless in my environment. 90% of it is broadcast traffic and we already network monitoring in place that watches the rest.

From the Event list on the policy.
endpoint-eventfilterlist-windows-v1:
relative_url: >-
/api/fleet/artifacts/endpoint-eventfilterlist-windows-v1/605031d962e0fafff715fecb1f1a4919d9ea480ebc531030947c8b706675a868
compression_algorithm: zlib
decoded_size: 1234
decoded_sha256: 605031d962e0fafff715fecb1f1a4919d9ea480ebc531030947c8b706675a868
encryption_algorithm: none
encoded_sha256: 2f7c5f9e9f1ae327e6c646da09e3dc1524db87c80713adc388016e605ad46b8a
encoded_size: 147

Here is the snip from event filters.

Totally my bad I didn't say that to clearly. It's more from the backend prospective. Minor version should not be subjected to changes in transforms or other pipeline events if its going to causing breaks like what happened here. Considering this only happened after 7.14.2+ "Confirmed on 7.15 as well". 7.14.0 and .1 were dropping events just fine.

This is only happing with the System integration and Security logs set to enable on the policy. Its the only one causing the bug. Disabling the security logs obviously drops the events as those event ID's are in the windows security logs. It's a dirty work around but I'll go back to the Winlogbeat to capture the rest of the events.

7.15 update went smooth to all agents minus the ones behind load balancers but that's expected. at this point. Quick enable to sec log and the same error occurred.

As an example of what I mean by this can get bad really quickly. This is from a dev machine with 33 agents using fleet for almost everything. Winlogbeat is still in use on critical servers.

Now I know that count is inaccurate sometimes in the millions below or above what it should be. This is the first time I've seen it above 200m in a 24 hour window.

@Kevin_Logan

This might be a bit of a problem not sure if the two share data between the pipelines. I checked for the pipeline ID on 2 7.15 on-prim clusters and it was not created.

message: "Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0xc04d47571bfff47c, ext:4841921073001, loc:(*time.Location)(0x5ff8220)}, Meta:{"raw_index":"metrics-system.core-default"}, Fields:{"agent":{"ephemeral_id":"81cfbfaa-4961-4520-859b-1edc12ffbef1","hostname":"REMOVED","id":"d71c2f08-d8c7-4fcc-92fe-667b17cba812","name":"REMOVED","type":"metricbeat","version":"7.15.0"},"data_stream":{"dataset":"system.core","namespace":"default","type":"metrics"},"ecs":{"version":"1.11.0"},"elastic_agent":{"id":"d71c2f08-d8c7-4fcc-92fe-667b17cba812","snapshot":false,"version":"7.15.0"},"event":{"dataset":"system.core","module":"system"},"host":{"architecture":"x86_64","hostname":"REMOVED","id":"0033295e-8e7b-4d45-992e-6433d23c164b","ip":["x.x.x.x"],"mac":["00:00:00:00:00:00"],"name":"REMOVED","os":{"build":"17763.2183","family":"windows","kernel":"10.0.17763.2183 (WinBuild.160101.0800)","name":"Windows Server 2019 Standard","platform":"windows","type":"windows","version":"10.0"}},"metricset":{"name":"core","period":10000},"service":{"type":"system"},"system":{"core":{"id":0,"idle":{"pct":0.982800},"system":{"pct":0.012500},"total":{"pct":0.017200},"user":{"pct":0.004700}}}}, Private:interface {}(nil), TimeSeries:true}, Flags:0x0, Cache:publisher.EventCache{m:common.MapStr(nil)}} (status=400): {"type":"illegal_argument_exception","reason":"pipeline with id [metrics-system.core-1.1.2] does not exist"}, dropping event!"

@PublicName apologies for the late reply.

I'm taking a closer look at the configuration that you have.

This is only happing with the System integration and Security logs set to enable on the policy. Its the only one causing the bug. Disabling the security logs obviously drops the events as those event ID's are in the windows security logs. It's a dirty work around but I'll go back to the Winlogbeat to capture the rest of the events.

If you are only using System integration and Security logs then the Event Filters feature will not filter out any of these events because it only effects events sent by the Endpoint integration.

As you said, it's just the Security logs that's causing this problem. To ensure that we're talking about the same thing, can you confirm that the below configuration in System integration is where you see the bug?

I mostly work with the Endpoint integration, so I'll raise with others who are more familiar the System integration.

Confirmed for Fleet agents.
Collect events from the Windows event log, Security.

Full list of integrations I use extensively. Elastic Endpoint + System + Windows.
Endpoint because well its good. System to capture what metricbeat and winlogbeat standalone does. I know endpoint and system have some duplicate data still but endpoint isn't the most reliable for some fields. Windows integration to pull powershell logs. The only thing missing is to pull the windows terminal services logs.

Agree this shouldn't have an effect on legacy beats modules as they have there own processors and go to different indices but as is it's not effecting endpoint integrations.

I only bring this up as I'm eager to kill off the legacy beats in favor of mostly centrally managed integrations. Enabling system integration while using Endpoint ended with the wonderful Java error "error.message: Cannot invoke "java.util.Map.size()" because "m" is null" and nothing being dropped. Forcing me to disable the security captures in system. Disabling the single portion and no more cluster crushing good times.

@PublicName What version of the System integration do you have installed. On the Settings page for the integration you can see the version (like shown in this post Elastic Endpoint - Filebeat - Java Error - #5 by andrewkroh).

@andrewkroh
For me it's system version 1.1.2 with no upgrades available. Elastic and Kibana 7.15.0.

Identical error as Elastic Endpoint - Filebeat - Java Error - #7 by andrewkroh
Cannot invoke "java.util.Map.size()" because "m" is null

This is only when Security option is enabled for the logs at least from what I've seen. I'm able to trigger the alerts at will by enabling security logs. I did do a quick check on the client side for winlogbeat to see if the java file was present like on the normal legacy beats. No luck as it looks embedded I was going to swap from a working version and overwrite to see but failed.

We're tracking the issue in [system] error.message for java.util.Map.size() because "m" is null · Issue #1789 · elastic/integrations · GitHub.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.