Ingesting multiline fields with the Grok processor

Hi

I've cloned the Postgresql integration pipelines and changed its grok patterns to fit our custom postgresql log format.

Almost everything gets parsed correctly with the following grok pattern:

%{POSTGRESQL_REMOTE_HOST:postgresql.remote_host} %{POSTGRESQL_USER:user.name} %{POSTGRESQL_DB_NAME:postgresql.log.database} %{WORD}\[%{NUMBER:process.pid:long}\]: \[%{BASE16FLOAT:postgresql.log.session_line_number:long}\-%{INT}\]%{SPACE}%{WORD:log.level}:  (?:%{POSTGRESQL_ERROR:postgresql.log.sql_state_code}|%{SPACE})(duration: %{NUMBER:temp.duration:float} ms  %{POSTGRESQL_QUERY_STEP}: %{ULTRAGREEDYDATA:postgresql.log.query}|: %{ULTRAGREEDYDATA:tmp.message}|%{ULTRAGREEDYDATA:tmp.message})

However the greedydata-dumping into "postgresql.log.query" cuts off as soon as a newline character is processed. Here's an example log which contains the literal tabs and newlines:

2023-05-09 08:02:03 CEST [local] postgres postgres postgres[375010]: [9-1]   LOG:  duration: 0.182 ms  statement: SELECT * FROM adis
        WHERE navn='Adis Pinjic'
        ;

When I expand its indexed document in the Discover tab and select JSON-output the message field contains the following:

 "message": "2023-05-09 08:02:03 CEST [local] postgres postgres postgres[375010]: [9-1]   LOG:  duration: 0.182 ms  statement: SELECT * FROM adis\n\tWHERE navn='Adis Pinjic'\n\t;",

Here you can see that postgresql.log.query gets cut off (in the same document):

    "postgresql": {
      "log": {
        "database": "postgres",
        "query": "SELECT * FROM adis",
        "session_line_number": 9,
        "query_step": "statement",
        "timestamp": "2023-05-09 08:02:03 CEST"
      },
      "remote_host": "[local]"
    },

The use of the "ULTRAGREEDYDATA" custom pattern is just a desperate attempt to allow any type of newline and tab characters. It looks like this:

         "pattern_definitions": {
            "ULTRAGREEDYDATA": """(.|\t|\n|\r|\s)*""", 

The "Custom Logs" integration I use to harvest the postgresql log file has the following multiline configuration in the integration's "Custom configuration":

multiline:
   match: after
   type: pattern
   pattern: '^\d{4}-\d{2}-\d{2}'
   negate: true

Have any of you solved a similar problem in your own clusters or have any pointers for how to debug this further? Our whole stack is running 8.6.2 (ES, Kibana, Logstash and Elastic Agents).

I've tried recreating the issue in the Grok Debugger, however the grok pattern has no problem parsing the \n and \t text from the message field ( SELECT * FROM adis\n\tWHERE navn='Adis Pinjic'\n\t;). I suspect that I don't quite understand how the Elastic Agent actually feeds the multiline data to Elasticsearch.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.