Struggling with filebeat and elasticsearch ingest pipeline for processing custom log file that has weird date/time format

Struggling with use of elasticsearch ingest pipeline for custom parsing of a unique flatfile log format. Running into problems. Not sure if there's mix of issues here.

Here's the scenerio...

Format uses multi-line for messages. Date/time format is, problematic....
YYYY.MM.dd HH:mm:ss.SSS

Am trying to use the log date/time as the timestamp of the document in the index.

Symptoms of problem:

  1. partial ingest of file (seems to stop after empty lines) - unclear if bug with filebeat tailing (on windows server or issue with setup of pipeline and/or filebeat config) Example, log file only has about 40+ records ingested to elasticsearch but there's over 10,000 lines in the actual file on filesystem,
  2. errors in filebeat log ...
    {"type":"illegal_argument_exception","reason":"Provided Grok expressions do not match field value: [2020.08.04 21:56:09.478 (cc-123): REDACTED\n2020.08.04 21:56:09.478 (cc-123): REDACTED"}

Here's setting in filebeat.yml config...

filebeat.inputs:
- type: log
  enabled: true
  tail_files: true
  paths:
  - 'X:/Logs/obj.*.log'
  tags: ["obj"]
  multiline.pattern: '^$'
  multiline.negate: true
  multiline.match: after
  index: "filebeat-obj-%{+yyyy.MM}"
  pipeline: clicklog_redate

Sample data (note the blank CRLF only line after WARNING lines is intentional and actual from log source sample):

:

2020.08.05 00:32:03.075 (main95): ERROR  *Redacted*
2020.08.05 00:32:03.075 (main95): ERROR: *Redacted*HttpHandlerEXCEPTION: Exception
                ================================
                Type: System.ArgumentException   
                *redacted*
                ================================
2020.08.05 00:32:07.357 (cc-154): WARNING: *Redacted*
2020.08.05 00:32:07.357 (cc-154): WARNING: *Redacted*.ToJsonString because no overloads accept 0 arguments.

2020.08.05 00:32:11.503 (cc-178): [6b213e25-8ef8-488d-ad48-0b22c432032e] *Redacted*: Handling GetDocumentInfoTx for document "94AB257DF1EF824B9109A7BBA0B5BAD0"
2020.08.05 00:32:12.596 (cc-191): [dc214da0-d52c-4e33-b79c-61ef331c32cf] *Redacted*

:

Here's my clicklog_redate ingest pipeline definition:

[
  {
    "grok": {
      "field": "message",
      "patterns": [
        "%{WOMTIMESTAMP:logtime}%{SPACE}\\(%{WORD:process}\\)\\:%{GREEDYDATA:message}"
      ],
      "pattern_definitions": {
        "WOMTIMESTAMP": "%{YEAR}%{NOTSPACE}%{MONTHNUM2}%{NOTSPACE}%{MONTHDAY}%{SPACE}%{TIME}%{NOTSPACE}%{INT}"
      }
    },
    "date": {
      "field": "logtime",
      "target_field": "@timestamp",
      "formats": [
        "YYYY.MM.dd HH:mm:ss.SSS",
        "ISO8601"
      ]
    }
  }
]

Any insights from experienced elasticsearch pipeline admin much appreciated! Could also use pointers on how to do basic debugging... how can I get information about the pipeline process itself - other than looking at the filebeat log (on one end) and the output (partial as it is) within elasticsearch?

I can't see any issue with your ingest pipeline on the timestamp front;

PUT _ingest/pipeline/discuss-test
{
 "description" : "https://discuss.elastic.co/t/struggling-with-filebeat-and-elasticsearch-ingest-pipeline-for-processing-custom-log-file-that-has-weird-date-time-format/243941",
   "processors" :
[
  {
    "grok": {
      "field": "message",
      "patterns": [
        "%{WOMTIMESTAMP:logtime}%{SPACE}\\(%{WORD:process}\\)\\:%{GREEDYDATA:message}"
      ],
      "pattern_definitions": {
        "WOMTIMESTAMP": "%{YEAR}%{NOTSPACE}%{MONTHNUM2}%{NOTSPACE}%{MONTHDAY}%{SPACE}%{TIME}%{NOTSPACE}%{INT}"
      }
    },
    "date": {
      "field": "logtime",
      "target_field": "@timestamp",
      "formats": [
        "YYYY.MM.dd HH:mm:ss.SSS",
        "ISO8601"
      ]
    }
  }
]
}

PUT discuss-test/_doc/1?pipeline=discuss-test
{
  "message": "2020.08.05 00:32:03.075 (main95): ERROR  *Redacted*"
}

And then a GET discuss-test/_doc/1 gives me;

{
  "_index" : "discuss-test",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "process" : "main95",
    "@timestamp" : "2020-08-05T00:32:03.075Z",
    "message" : " ERROR  *Redacted*",
    "logtime" : "2020.08.05 00:32:03.075"
  }
}

This Filebeat config seemed to work for me;

filebeat.inputs:
- type: log
  enabled: true
  paths:
  - '/path/to/*.log'
  multiline:
    type: pattern
    pattern: '^[0-9]{4}.[0-9]{2}.[0-9]{2}'
    negate: true
    match: after

output.console:
  pretty: true

I think the ^$ was a little too broad in its capture.

Thanks for the response Mark and validating config of my pipeline setup.

Noticed in your data for test it appears to only be trying with a single line... but the files contain multi-line entries (some with empty lines).

Trying with the one thing that appears different with your config:

  multiline:
    type: pattern
    pattern: '^[0-9]{4}.[0-9]{2}.[0-9]{2}'
    negate: true
    match: after

Am still having situation with only partial file ingest.

There are many of these sort of errors in the filebeat log (this is actual from filebeat log as example):

{"type":"illegal_argument_exception","reason":"Provided Grok expressions do not match field value: [2020.08.19 11:04:21.032 (cc-11): [bcac3929-a3a9-4916-80bb-5196b83d1952] WopiController.RunWopiRequestInClickTx: Handling GetDocumentInfoTx for document \"D7AB650CDB3C11EA6484523796565000\"]"}

The related line from file being ingested...
2020.08.19 11:04:21.032 (cc-11): [bcac3929-a3a9-4916-80bb-5196b83d1952] WopiController.RunWopiRequestInClickTx: Handling GetDocumentInfoTx for document "D7AB650CDB3C11EA6484523796565000"

'Hi All'

I'm new to the platform, so forgive me for perhaps commenting on an irrelevant post 
I'm trying to build a custom Filebeat module but to me the Doc is a little unspecific:
https://www.elastic.co/guide/en/beats/devguide/current/filebeat-modules-devguide.html

I've come as far as creating the module, the fileset and writing out a log/ingest/pipeline.json

However when I try to run make create-fields I keep hitting stumbling blocks:
i.e 

mage generate:fields
Error: cannot read pipeline: invalid character ']' looking for beginning of value
make: *** [create-fields] Error 1

Any further guidance would be greatly appreciated. 

Please post your own topic for this question.

@warkolm I used the handy built-in console in Dev Tools to run some simulate ingest data into the clicklog_redate pipeline which has been failing with "illegal_argument_exception" with Grok expression not matching field value.

POST _ingest/pipeline/clicklog_redate/_simulate
{
  "docs":[
    {
      "_source":{
        "message":"2020.08.19 11:04:21.000 (cc-11): [bcac3929-a3a9-4916-80bb-5196b83d1952]"
      }
    }
    ]
} 

I found that the "cc-11" values which I'm using as the "process" field and trying to grok using \(%{WORD:process}\)\: is apparently thrown off by the - character. Having a - in "WORD" value must terminate that particular mask (or whatever).

So, I'm not very grok saavy, but what would be the solution here? Sometimes there's a - in the process name, sometimes not.

Found that can use the USERNAME grok pattern instead of WORD and it handles the dash character boundary as if part of the same thing (which it is in this case), so ...

SOLUTION changed the redate pipeline pattern from this:

"patterns": ["%{WOMTIMESTAMP:logtime}%{SPACE}\\(%{WORD:process}\\)\\:%{GREEDYDATA:message}"]

...to this...

"patterns": ["%{WOMTIMESTAMP:logtime}%{SPACE}\\(%{USERNAME:process}\\)\\:%{GREEDYDATA:message}"]

And it's able to correctly parse the process name with dashes or without.

Related note... I really like the /simulate endpoint for testing pipelines!
_ingest/pipeline/clicklog_redate/_simulate

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.