Ingest Pipeline KV Processor

Hi Team,

I am using custom logs integration to receive data via an elastic agent(ELK Version 8.13.4). The data looks something like this.

"message": " URL : https://www.google.com\n Action Type : Response Received\n RequestDateTime : 9/24/2024 18:06:15\n ResponseDateTime : 9/24/2024 18:06:15\n ErrorCode : \n Message : \n Stage : Response Came From Web Request\n ErrLog Generated_on : 2024/09/24 06:06:15:718\n\n TrxRef.No : 111111111\n InputXML : XXXXXXXXXXX\n ResponseReceivedData : AN is mandatory0001\n---------------------------------END LOG---------------------------------------------------"

I am trying to field_split the data by "/n" (newline) and the value split by ":". The following kv filter is being used, however it is not working.

kv {
source => "message"
field_split => "\n"
value_split => ":"
trim_key => " "
trim_value => " "
}

It appears that the field split does not work correctly. I tried the following workarounds, replacing /n with some other character such as # and trying to field split on it, no success either. I also tried escaping the new line character by "\n" and "\\n", however it is not splitting the data as expected.


Since the field split does not happen correctly, the value split puts all data in the first key.
The same kv filter works fine when tested in logstash, can someone please help me ?

The configuration you shared is for logstash kv filter, not for the Elasticsearch ingest pipeline processor.

Can you share the processor you are using in the ingest pipeline?

Hi @leandrojmp ,
Please check in the snapshot.
I am using kv processor with field split "\n" and value split by ":".

PUT _ingest/pipeline/y
{
"processors": [
{
"kv": {
"field": "message",
"field_split": "\\\n",
"value_split": ":",
"trim_key": "" "",
"trim_value": "" ""
}
}
]
}

Can you share how your document looks like in Kibana?

You are telling the kv processor to work on the message field, but on the screenshot you shared it seems that your message is on a field named URL.

Please share how your message looks like to confirm which is the correct field.

Hi @leandrojmp ,
Sorry for the confusion.
The message field is as:

"message": " URL : https://www.google.com\n Action Type : Response Received\n RequestDateTime : 9/24/2024 18:06:15\n ResponseDateTime : 9/24/2024 18:06:15\n ErrorCode : \n Message : \n Stage : Response Came From Web Request\n ErrLog Generated_on : 2024/09/24 06:06:15:718\n\n TrxRef.No : 111111111\n InputXML : XXXXXXXXXXX\n ResponseReceivedData : AN is mandatory0001 \n---------------------------------END LOG ---------------------------------------------------"

The screenshot is of the output post ingest pipeline processing the data.As you can see the data is being added to the URL key.As the split on the new line did not happen.Based on the configuration the kv processor should first split the message field on new line character and then split it into key value pair on the : character

Hi @Ankita_Pachauri

You have a couple of things going on....

Look at this....

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "kv": {
          "field": "message",
          "field_split": "\\\n",
          "value_split": ":",
          "trim_key":  " ",
          "trim_value": " "
        }
      }
      ]
  },
  "docs": [
    {
      "_source" : {
        "message": " URL : https://www.google.com\n Action Type : Response Received\n RequestDateTime : 9/24/2024 18:06:15\n ResponseDateTime : 9/24/2024 18:06:15\n ErrorCode : \n Message : \n Stage : Response Came From Web Request\n ErrLog Generated_on : 2024/09/24 06:06:15:718\n\n TrxRef.No : 111111111\n InputXML : XXXXXXXXXXX\n ResponseReceivedData : AN is mandatory0001\n---------------------------------END LOG---------------------------------------------------"
      }
    },
    {
      "_source" : {
        "message": " URL : https://www.google.com\n Action Type : Response Received\n RequestDateTime : 9/24/2024 18:06:15\n ResponseDateTime : 9/24/2024 18:06:15\n ErrorCode : \n Message : \n Stage : Response Came From Web Request\n ErrLog Generated_on : 2024/09/24 06:06:15:718\n TrxRef.No : 111111111\n InputXML : XXXXXXXXXXX\n ResponseReceivedData : AN is mandatory0001\n"
      }
    }
  ]
}

And the result...

{
  "docs": [
    {
      "error": {
        "root_cause": [
          {
            "type": "illegal_argument_exception",
            "reason": "field [message] does not contain value_split [:]"
          }
        ],
        "type": "illegal_argument_exception",
        "reason": "field [message] does not contain value_split [:]"
      }
    },
    {
      "doc": {
        "_index": "_index",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "ResponseDateTime": "9/24/2024 18:06:15",
          "Message": "",
          "InputXML": "XXXXXXXXXXX",
          "ErrLog Generated_on": "2024/09/24 06:06:15:718",
          "message": """ URL : https://www.google.com
 Action Type : Response Received
 RequestDateTime : 9/24/2024 18:06:15
 ResponseDateTime : 9/24/2024 18:06:15
 ErrorCode : 
 Message : 
 Stage : Response Came From Web Request
 ErrLog Generated_on : 2024/09/24 06:06:15:718
 TrxRef.No : 111111111
 InputXML : XXXXXXXXXXX
 ResponseReceivedData : AN is mandatory0001
""",
          "TrxRef": {
            "No": "111111111"
          },
          "ResponseReceivedData": "AN is mandatory0001",
          "Action Type": "Response Received",
          "URL": "https://www.google.com",
          "RequestDateTime": "9/24/2024 18:06:15",
          "Stage": "Response Came From Web Request",
          "ErrorCode": ""
        },
        "_ingest": {
          "timestamp": "2024-09-27T16:24:23.001941814Z"
        }
      }
    }
  ]
}

So a couple things....

First that text at the end

---------------------------------END LOG---------------------------------------------------

Does not have a : so it can not be split and fails

Also in the middle of the message

ErrLog Generated_on : 2024/09/24 06:06:15:718\n\n
Has 2 \ns so that also fails so I manually cleaned that up in the example above

You will need to clean that up with some processing ahead of time...

Not pretty but this works...

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "gsub": {
          "field": "message",
          "pattern": "\\\n\\\n",
          "replacement": "\\\n"
        }
      },
            {
        "gsub": {
          "field": "message",
          "pattern": "---------------------------------END LOG---------------------------------------------------",
          "replacement": ""
        }
      },
      {
        "kv": {
          "field": "message",
          "field_split": "\\\n",
          "value_split": ":",
          "trim_key":  " ",
          "trim_value": " "
        }
      }
      ]
  },
  "docs": [
    {
      "_source" : {
        "message": " URL : https://www.google.com\n Action Type : Response Received\n RequestDateTime : 9/24/2024 18:06:15\n ResponseDateTime : 9/24/2024 18:06:15\n ErrorCode : \n Message : \n Stage : Response Came From Web Request\n ErrLog Generated_on : 2024/09/24 06:06:15:718\n\n TrxRef.No : 111111111\n InputXML : XXXXXXXXXXX\n ResponseReceivedData : AN is mandatory0001\n---------------------------------END LOG---------------------------------------------------"
      }
    }
  ]
}

Result

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "ResponseDateTime": "9/24/2024 18:06:15",
          "Message": "",
          "InputXML": "XXXXXXXXXXX",
          "ErrLog Generated_on": "2024/09/24 06:06:15:718",
          "message": """ URL : https://www.google.com
 Action Type : Response Received
 RequestDateTime : 9/24/2024 18:06:15
 ResponseDateTime : 9/24/2024 18:06:15
 ErrorCode : 
 Message : 
 Stage : Response Came From Web Request
 ErrLog Generated_on : 2024/09/24 06:06:15:718
 TrxRef.No : 111111111
 InputXML : XXXXXXXXXXX
 ResponseReceivedData : AN is mandatory0001
""",
          "TrxRef": {
            "No": "111111111"
          },
          "ResponseReceivedData": "AN is mandatory0001",
          "Action Type": "Response Received",
          "URL": "https://www.google.com",
          "RequestDateTime": "9/24/2024 18:06:15",
          "Stage": "Response Came From Web Request",
          "ErrorCode": ""
        },
        "_ingest": {
          "timestamp": "2024-09-27T16:31:03.335552899Z"
        }
      }
    }
  ]
}

Hi @stephenb,
Many many thanks for your support. Really appreciate your prompt response. The issue got resolved.

Just a small check, I checked the documentation for 8.13.4 and found that there is no XML processor. As you can see from our data, that it has some XML code too. Is there any way to parse this data?

Hi @Ankita_Pachauri

Unfortunately there is no XML ingest processor.

Logstash has a. XML filter perhaps you can look at that

@leandrojmp and @stephenb.
Many thanks for your support.