Elastic Endpoint 8.1 - File locking issues

Recently updated to 8.1 being it's only been out for a day...

Endpoint is locking certain files which is stalling applications causing huge problems! This forces applications to crash and corrupt databases files. Only work around has been adding dozens of application trusted applications which defeats the purpose of having the agent.

I want to bring this to attention as it's only happing on Windows 2008 R2, Windows Server 2012 R2 devices. While these are legacy and no longer supported if anyone else run into the issue you'll need to add the bypass. Not ideal but will get you running again.

File locks in question are .DAT, .LOG. Log being a binary log not text log file.

Edit Add:
YaraLib.cpp:94 Yara rule (Windows_Trojan_CobaltStrike_8519072e) compile warning: $a4 is slowing down scanning"
Yara rule (NgramRule_64) compile warning: $hex_string is slowing down scanning","process":{"pid":6848,"thread":{"id":7012}
Yara rule (NgramRule_193) compile warning: $hex_string is slowing down scanning","process":{"pid":6848,"thread":{"id":7012}
Yara rule (NgramRule_240) compile warning: $hex_string is slowing down scanning","process":{"pid":6848,"thread":{"id":7012}

Rules that appear to be causing problems found on several servers.

Hi @PublicName. I'm sorry to hear that you're encountering issues. What version did you update from?

Would you be okay capturing a Process Monitor log of the issue occurring so we can take a closer look? If so, I'll DM you a secure upload link.

7.17 to 8 then 8 to 8.1. Some of the devices were new installs of 8.1 as the fleet update option fails when the device is behind a load balancer.

Of course I can try and grab the log for you it's at random times so it's difficult to capture. Most of the devices have 1k+ connections running at a time so that log is massive spam until it happens. Dev is lightly hit so wasn't happening at all. I have one of the application logs that shows the locks pretty clearly it dose miss the process info so not helpful to you directly. 2 devices required more then a dozen trusted application entries to get started again.

Are you adding Trusted Applications or Exceptions? The former is more-suited for performance issues and conflicts.

Trusted application. In other AV products it's called exceptions/exclusions in some so in my head it's one in the same.

Exemptions in Elastic from what I see is to exclude an event from the SIEM rules from flagging on a particle known and accepted event for example DNS to the internet which would be expected from a domain control not a workstation. Add the servers as an exemption so you don't get hammered to death with false alerts.

I'll edit that to say trusted or it will get confusing quick...

@PublicName We believe we've identified the issue - a bugfix introduced in 7.17.1, 8.0.1, and 8.1.0 to address file sharing violations came with a race condition that can sometimes manifest as a handle leak. These leaked handles can prevent deletion of files via the FILE_FLAG_DELETE_ON_CLOSE flag.

We have a fix in testing now that we're hoping will make it into 8.1.1 which is currently due for release next week. Thank you for your help and patience identifying this issue.

This is why you guys are awesome! That was a quick turn around hoping it makes the cut for 8.1.1.

1 Like

Hi @PublicName,

I have good news - 8.1.1 is now released!

Please let us know if you have any other issues, and thank you for providing valuable feedback.

Regards,
Gabriel

Perfect updating now I'll let you know if I still have the locks.

So far so good. It's only back on a few machines. I did notice a few things with 8.1.1

  1. Server 2008, 2012, 2012 R2 all end up in an Unhealthy state even after a fresh reinstall. Updating from 8.1 to 8.1.1. requires a restart and even then it's a 50/50 if it works. Not a huge issue as they are technically end of live OSes. On 2016+ I did not run into this issue.

  2. The dreaded impossible to delete file issue in Endpoint folder came back on a few machine which leaves it impossible to update the machines. This happened back in 7.10 as well. Completely random machines with no reason.

What you end up seeing is even after setting permissions to Administrators full then do recursive or even a 1 to 1 on NTFS and changing the owner to Administrators or yourself you can not delete the files. Even in safe mode the are locked. It's almost like they are immutable which I didn't think was possible on Windows. On linux you just change the attribute with chattr -i and you can delete.

This has only been on 2 machines so makes it confusing. These machines have no internet access either and only a handful of people know they exist.

  1. Not related to Agent or Endpoint but Kibana is slow af with 8.1.1. No idea if any changes where made to lazy to look but the SIEM tab drops to chrome asking to please wait...

  2. Endpoint still uses 50% CPU on a duel core proc which is common on smaller servers. This is way higher then comparable services like VMware Defense for example which never exceeded 15% and offers the same visibility just lacks decent search or layout like Elastic.

@PublicName how did you find the rules which were causing the issues? We too are experiencing this issue but we are running an older version 7.16.2. If you let me know how did you log the rules will be helpful for us also to find out what's causing this.

I'm glad 8.1.1 is working better for you but of course dismayed there are still issues. The issue we fixed in 8.1.1 was introduced in 7.17.1/8.0.1/8.1.0. We'll keep digging to try to find what's causing the problem with older releases like 7.16. Could either of you share some details to help us narrow down the problem?

  • Is seems the problem occurs on Linux and Windows?
  • Does the problem persist when Endpoint Security is removed?
    • Endpoint doesn't run in safe mode, so my guess is it does.
  • Can you give any indication on which file paths are affected? Are the files in the temp directory? Application files? On a remote share?
  • Are there any features in Endpoint's Policy that when disabled cause the problem to stop occuring?

For Endpoint logs you can check in:
"C:\Program Files\Elastic\Endpoint\state\log"

You will need to copy Endpoint to another folder to access, use an elevated command prompt then launch notepad as an admin or use the admin shares and connect that way then you can open as normal.

What odd ball things are you seeing? 7.16.2 was remarkably stable for me.

1 Like

@ferullo

  • For us it occurred on Windows with agent 7.16.2.

  • We couldn't test unfortunately because it was causing huge production issues and the client had to remove all the agents.

  • The files that were getting locked were in different non C:/ directories like D: , E:, F:. The files that got locked up were .DAT files.

  • I am trying to get some non production servers an run the agent on it again, but the issue was so random I don't know if it will happen again on the non productions servers.

I had the WINS-service on a Windows 2016-server being locked up by this using Agent 7.17.1.
In the supportticket with Elastic i had i was told 7.17.2 should fix this as well, but the changelog doesnt mention file locks.. Is it fixed there or do we have to do major upgrade to 8.1.1?

File locking fix is 8.1.1.

Be warned 8.1.1 has other problems that started to show back up. It been sucking all machine resources like CPU and Memory and forced me to keep it off and stay with our fall back AV setup. It only seems to happen on high network connection based machines.

Any holy smokes WINS that's a services that I though disappeared from the world.

Hi @slash24 The issue that was introduced in 7.17.1, 8.0.1, and 8.1.0 was fixed in 8.1.1 and 7.17.2. The releases notes for both include a comment "Fixes an Endpoint Security integration bug that prevented benign Windows files from being deleted under certain circumstances" which refers to that fix.

We have not seen internally or found a way to reproduce the issue described by @ame123 in 7.16. If we can fix it, we'd fix it in 7.17 and 8.x. If anyone reading this is seeing the issue with 7.16 and is able to debug it with us please reach out, we'll DM with you to gather the data we need.

1 Like

@PublicName what version was previously working well for you? Can you describe the type of load on the machine(s) that have CPU and memory issues so we can try to replicate it.

EDIT: I need to read better. You did describe the load.

Thanks for the headsup regarding 8.x

We certainly don't utilize WINS, but the corresponding service was running, and it being unable to operate caused VSS-writers to halt, which in turn caused backups to fail, there's a all or nothing when VSS is involved

File lock issue has not been resolved in 8.1.1 just had it happen again.