So the cluster has been running for about 2 months no issues or changes. Recently within the past 5 days:
Bulk Indexing of signals failed: reason: "No mapping found for [@timestamp] in order to sort on" type: "query_shard_exception" name: "RPC (Remote Procedure Call) from the Internet" id: "d47c9279-89e9-489e-b626-639f12103f0b" rule id: "143cb236-0956-4f42-a706-814bcaa0cf5a" signals index: ".siem-signals-default"
On about 60+ rules.
I haven't had the time to go back and figure out why this happened. Anyone have a pointer it looks easy to fix as it's missing a mapping but not sure on the safest course of action would be at this point.
The indexes they are looking at probably have a mapping issue where they no longer have a mapping for @timestamp, that's where you want to start out. Take one of the rules it is failing on, look at the mapping for the indexes it's using and make sure each of those indexes have a valid mapping for @timestamp
Correct. And to make it even worse not even a manual override to event.ingested fails as well. This is 60+ rules that I have tried all with the same failure. Odd that it works on the dev cluster and the prod one which is never modified until after testing has been completed is the one that is having the issues.
The ".siem-signals-default" is 100% default configuration.
Please keep in mind none of these indexes have ever been modified by me. The pre made hidden index are never touched. I leave that to you wonderful dev's to tinker with.
It has been confirmed as an issue on github with 7.10.
After duplicating several of them it's even stranger. Some will work with event.ingested and some wont so it's clear that the index name exist.
So we know, you're on 7.10 or 7.10.1 or 7.10.2 (the latest)?
I think the message it is outputting is a bit misleading. Although it points out that it is writing to the:
signals index: ".siem-signals-default
That's not really where the query is likely having issues, or at least shouldn't. That message is giving us additional information so we know where it is trying to write data into. For the source indexes of one of the rules I would check the mappings of those to see if for some reason someone has added an additional index without a mapping for @timestamp
For example if a rule that is failing has a source index of filebeat-* you can do this query:
GET filebeat-*/_mapping/field/@timestamp
In your dev tools and then see if each of the indexes has a mapping for @timestamp
If one doesn't then you will see something like this in your output:
Had 1 laptop that came back up about 2 weeks ago just as this started with the winlogbeat 7.6 agent as it's a remote machine.
So now the scary part this is very fragile if all it took was an old agent to take out a large part of the security platform. Any chance of having failed ones ignored and noted vs taking out an entire detection rule set. Seems like it would be a very fun way to go undetected in a network for a long time.
That's great news if you have found the problem. Hopefully you just added the needed mapping and are good now?
I pointed several people to this post and will hopefully hear back soon. To let you know, we do have for the upcoming release several different types of fixes for how we do failures as you're suggesting:
Those two are mostly so that we can help migrate people to using both the @timestamp and the event.ingested time stamps together so we can sort on both and use the other as a fall back. It is looking like that second PR's intent is to show only partial error messages and still continue if one index is missing a timestamp.
I'm double checking but I think we will be able to avoid these issues in the upcoming release for everyone.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.