Checking log parsing using the Simulate API

Disclaimer: As with the OP of the linked thread, I too am learning but the Simulate API is a little confusing to me so apologies for my noobiness. Thank you in advance for your insights and feedback.

In this thread ES 7.15 FileBeats Sophos XG module not separating data into variables, @stephenb tested the parsing of a log message using the Simulate API. When they did this, the message used was not the actual raw log message but rather a message that was already in ES. Perhaps I am mistaken but I thought part of the functionality of Filebeat was to parse log messages so I have these questions:

  • Why was the message in ES used instead of the original raw log?
  • Can raw log messages also be used with the Simulate API to test the parsing and ingestion of raw logs from a given source?
  • When @stepheb said "... I took the sample message you provided and put in a %{SYSLOG5424PRI} the <44>...", what is meant here? When I look at my Sophos ingest pipeline, I do see the %{SYSLOG5424PRI} but not the <44> (example below) so I'm not sure to what their statement refers or why this is relevant:
"grok": {
      "field": "message",
      "patterns": [

Hi @aleding

Great questions lets see if I can help.

With Filebeat and Elasticsearch there are options for parsing in Filebeat on the originating host and / or in Elasticsearch with an ingest pipeline.

There are pros and cons to both but often the goal is to minimize the processing on the source host.

With the example above The Ingest parsing occurs in Elasticsearch after the raw message has been harvested and then sent TO Elasticsearch and then the ingest pipeline that runs IN Elasticsearch performs the parsing.

  • Why was the message in ES used instead of the original raw log?

When filebeat harvest the raw message it must send it to Elasticsearch in json format so the contents of the raw log line end up in the message field in the json sent to Elasticsearch.

The message field is essentially the raw message.

the <44> is an example of the Syslog Priority field from an example message it will not be in the pipeline the <44> gets parsed by the %{SYSLOG5424PRI} in the grok statement.

What the _simulate API allows is testing of and ingest pipeline with sample data... great for building and debugging

Hope that helps!

Hi @stephenb - So thanks to you, I think this just clicked for me...Let me restate to make sure I really have this...

  1. When the message is harvested, it comes into the Sophos pipeline which has a bunch of processors one of which being (at the very top) a Grok that basically grabs SYSLOG5424PRI and the rest of the message via GREEDYDATA:log.original.

  2. Then, the message is further parsed via the remaining processors but in the case of the Sophos module, we had some issues so essentially those later processors really don't do anything.

  3. However, in order to test, we need to provide the Simulate API with an input that is the same as the result of the last successful parse - i.e. the Grok mentioned above. This is why we need to prepend the raw log message with what the Grok already did which is basically, the %{SYSLOG5424PRI} <44>.

Did I get this right?

Yes that is correct


Hmmm not sure what "Simulate API with an input that is the same as the result of the last successful parse" means

In general you have 2 choices if the data does not match the ingest pipeline a) update the pipeline or b) update the data (usually not easy since coming from an packaged app)

Why don't you share a sample message so I can take a look... and perhaps show what options there are.

Hi @stephenb - again, really appreciate your time with me here...

Here is a typical firewall log message from my Sophos XG device that is sent to the Sophos XG Filebeat module:

Apr 15 12:35:43 fw_test device="SFW" date=2022-04-15 time=12:35:44 timezone="PDT" device_name="SFVH" device_id=C010017U4EHYW6K log_id=010101600001 log_type="Firewall" log_component="Firewall Rule" log_subtype="Allowed" status="Allow" priority=Information duration=61 fw_rule_id=16 nat_rule_id=3 policy_type=2 user_name="win10_wifi" user_gp="CL_ALL-1" iap=15 ips_policy_id=8 appfilter_policy_id=0 application="" application_risk=0 application_technology="" application_category="" vlan_id="" ether_type=Unknown (0x0000) bridge_name="" bridge_display_name="" in_interface="Port1" in_display_interface="Port1" out_interface="Port3" out_display_interface="Port3" src_mac=ba:da:ce:01:01:01 dst_mac=ba:da:ce:02:02:02 src_ip= src_country_code=R1 dst_ip= dst_country_code=USA protocol="TCP" src_port=51811 dst_port=443 sent_pkts=40  recv_pkts=31 sent_bytes=10419 recv_bytes=9384 tran_src_ip= tran_src_port=0 tran_dst_ip= tran_dst_port=0 srczonetype="LAN" srczone="LAN" dstzonetype="WAN" dstzone="WAN" dir_disp="" connevent="Stop" connid="3292025856" vconnid="" hb_health="No Heartbeat" message="" appresolvedby="Signature" app_is_cloud=0

My understanding was that the Simulate pipeline API simulates the same function that is done by Filebeat. I also thought Filebeat basically took the log message received from the Sophos FW, parsed the data, and then sent that parsed version in JSON format up to ES. With this in mind, it made sense to me that I should use the raw message (included above) as the input into the Simulate pipeline API. I figured anything else would result in more parsing issues.

What is confusing to me is that If I am trying to test if Filebeat is parsing correctly, why would I use a document already submitted to ES, that has already been parsed (albeit not as desired), rather than a raw log message?

I know there is something I'm missing here but just not sure what that is...

yes give me a sample document in Elasticsearch go to Discover and pull the json of one of the entries and post it here redact and IP addresses etc.

No Simulate does not simulate filebeat it simulates the ingest pipeline that lives in Elasticsearch that is associated with the filebeat module sophos... that gets called when filebeat send a document (log line in JSON) to Elasticsearch and the document is parse throu

Filebeat is not pre-parsing the data as far as I know...

Data Flow Looks like this depending on the input you defined in sophos.yml file please share that file

If reading from a file
sophos text log line -> filebeat reads log file -> filebeat send json message with message field as raw log line -> elasticsearch -> ingest pipeline to parse data -> document written in elasticsearch

or syslog udp / tcp
syslog -> filebeat reads syslog -> filebeat send json message with message field as raw log line -> elasticsearch -> ingest pipeline to parse data -> document written in elasticsearch

Please share your sophos.yml and a sample json data from Discover in kibana

Because I want to see it! :slight_smile: so I can help... Yes it has been partially but I want to see the message field! because the parsing does not happen at filebeat it happens in Elasticsearch...

Help me ... help you... if you want

Oh yes - me want - and I will show you the kwan (shameful Jerry McGuire ref)...

But first, need to clarify a couple things:

  1. I don't have a sophos.yml but I do have several files that make up the Sophos XG module (i.e. pipeline.yml, firewall.yml, wifi.yml, event.yml, etc. - 12 in all)

  2. As to the JSON doc - it is hugely long - you sure you want me to post that here?

What documentation are you following? We have a built-in module for sophos. Are you following some blog or something else??

Are you using log stash in the middle? What is your architecture and what are you actually doing ? now I'm totally confused.
You should be using our built-in sophos module

This will be the quickest path to success

You should be using this quick start guide But with the sophos module

Ahhh - ok...

So first, I've been using only the Elastic docs. However, my implementation uses Salt to handle all config changes and as such, I have Filebeat module configs bundled into a single YAML file. I have included the Sophos module portion below:

- module: sophos
	  enabled: true
	  var.input: udp
	  var.syslog_port: 9514
	  var.default_host_name: fw_test

All of my modules are the ones that come with Filebeat - nothing built by me. So the manifest.yml and the *.yml files mentioned above are all the same as what come with Filebeat.

And yes, Logstash is in the mix but I have checked my pipelines and nothing appears to be touching the Sophos traffic so I'm fairly confident that whatever is coming out of Filebeat should be going right into ES. I will double-check this again now...

The potential issue is if you use logstash and you don't use / configure it correctly, it will not call the correct sophos ingest pipeline defined by the filebeat module so the data will not be parsed.

So I am not sure if you are using the filebeat sophos module or trying to parse in logstash... those are 2 completely different approaches...

For sophos module there is no wifi.yml etc so not sure that that so I have no clue what all this is

(i.e. pipeline.yml , firewall.yml , wifi.yml , event.yml , etc. - 12 in all)

except pipeline.yml... that is a logstash...

So sound like you have complex logstash.... not sure if you built it or "downloaded it" you do not need logstash to parse the sophos...

I Always suggest to go directly from filebeat to Elasticsearch first using the module and make sure it all works and then when it does will introduce logstash into the mix. Then logstash can just be a pass through and you can use the builtin filebeat module.

That's my suggestion... do the quickstart directly to Elasticsearch and get it working then introduce logstash... otherwise too many variables..

Also if you did not run filbeat setup after enabling the module it will not work either

If you are using logstash with a filebeat module You need to read this

Plus if your logstash does not have an output that looks something like this... it will never work.

output {
  if [@metadata][pipeline] {
    elasticsearch {
      hosts => ""
      manage_template => false
      index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM.dd}"
      pipeline => "%{[@metadata][pipeline]}" 
      user => "elastic"
      password => "secret"
  } else {
    elasticsearch {
      hosts => ""
      manage_template => false
      index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM.dd}"
      user => "elastic"
      password => "secret"

If you are using some 3rd party / other "logstash stuff" package etc...etc... I probably can not help much.

You will need to post ALL you configs, filebeat, logstash etc.. etc.. otherwise no one will be able to help.

You said you were new... I am trying to tell you the easist way... which is

  • Using the filebeat module
  • Setup directly to Elasticsearch get it working
  • Then if you have to use logstash ... configure filebeat and logstash to use logstash
    as a passthrough

That's great information @stephenb - thank you so much. Looks like there's a bit more for me to research so I'll go continue my digging and return with any updates and or questions.

Thanks again - very much appreciated...

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.