I have seen several people ask this question, but there has been no complete solution delivered. Several people have said "just do this" or "just do that" without telling us what "this" or "that" are.
So the requirement is:
receive messages (syslogs, and various beats inputs) into Logstash
pull the messages apart to understand the content
send all messages to Elastic for storage and searching using ECS
send a selected group of messages (based on what we saw when we pulled it apart) to a SIEM tool in syslog format BUT it has to be in the original syslog format without addition Logstash fields in front of it, so that the SIEM tool can understand the content.
In one sentence - everthing goes to the Elastic for later searching, and a subset of messages go to the SIEM if they are interesting security events.
Has anyone got a pattern for this; it seems like a fairly common requirement.
Thanks
Ross
The reason that you don't see any "complete solution" is that while it sounds so simple, it is actually really really complicated because of all of the different variations logs that might be observed. To explain how to do it, and I mean "do it at large scale, high throughput, highly available and in realtime" would literally mean writing a book... a thick 400 or 500 page book.
To arrive at what you are asking is a journey. You start with a one source and work to get it doing what you want. Then you add another source, and you are at the first hurdle... how to "detect" which kind of message it is and handle either format as needed. And then the vendor updates their software and message changes slightly, so you have to add in a bunch of logic to handle various "what if" scenarios. Then comes the third source, and things start to get unwieldy so you need to break it up... perhaps into multiple pipelines, maybe even multiple Logstash instances... so maybe you need Kafka to interconnect all of those instances. And the journey continues...
You could easily spend at least a year working to achieve what you so simply requested in 4 bullet points. Or you hire a consultant who can do it for you in a fraction of that time. But make sure that they can really deliver what they promise (because they already have that year+ of learning curve behind them). Unfortunately there aren't many of them out there. Either they know the Elastic Stack, and other needed technologies, but don't really understand the use-case (this is IMO what you see in many of Elastic's own modules). Or they know the use-case inside and out, but don't know enough about the Elastic Stack to make it a reality.
Either way it is a challenge... you either put in the work, or find the right help and spend some money.
Sorry. I know that isn't really the answer for which you hoped. However, I wanted to be honest with you about why you never see the answers you want.
Thanks for that.
We understand the magnitude of what we are looking at, and at the moment we are just trying to get one logsource going to prove that the data pipeline we are proposing will deliver on our business need. I guess I was just venting that there seem to be some people on the forums who know some of the answers, but they know the product so well that their answers are very high level ("you just do blah", but "blah" is not documented anywhere) and don't really help those of us who are still learning the capabilities of the solution.
While this deck was intended to be presented by me while walking through each step on a live system (using the sample data and pipeline config which are also in the repo), you may find it helpful.
Hi Rob, thanks for that link. We are making some progress. We have decided to set the Logstash input as raw UDP on port 514, then the first thing we do is duplicate the raw packet into the metadata. We then output to two pipelines, one that does our advanced processing, pulling apart the logs, converting to ECS etc. before sending to Elastic, and the second pipeline uses the "spoof" output processor to send the original UDP message (that we copied into the metadata) to our existing SIEM as a raw syslog message, faking the source IP etc. so that the SIEM can recognise and process the log data properly.
Seems to be working well, but we now have the immense task of parsing the logs from a number of network devices into ECS.
Thanks again for the help
Ross
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.