XML through ingest attachment only showing values in attachment.content, not full tree


(Anders Ekenstierna) #1

Hi.

We're in the process of starting to use Elastic for logging in our integration platform. When sending to Elastic through the ingest-attachment pipeline the payload in attachment.content field is only showing the values from the elements instead of the entire xml tree. Is there any way of solving this or is that just the way the attachment pipeline parses the xml? When sending a JSON in the payload the entire tree is present not just the values. Since we are just the intermediator here we will not be in control of what is sent in the payload.

Processor definition:

{
      "description" : "Extract payload information",
      "processors" : [
    {
      "attachment" : {
        "field" : "payload",
        "target_field" : "attachment"
        
      }
    }
      ]
    }

Message to log:

    {
    	"timestamp": "2018-02-22T09:57:00.654+01:00",
    	"brokername": "TESTNODE",
    	"executiongroup": "default",
    	"appname": "TEST_APPLICATION",
    	"flowname": "TEST_APPLICATION_SendToElastic",
    	"loglevel": "INFO",
    	"logtext": "testing log functionality",
    	"metadata": "test",
    	"exceptionList": "",
    	"payload": "PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4NCjxDcmVhdGVSZXF1ZXN0IHhtbG5zPSJodHRwOi8vdGVzdC5vcmcvVGVzdC9DcmVhdGVSZXF1ZXN0LzEuMCI+DQogICAgPEhlYWRpbmc+aGVhZGluZ1N0cmluZzwvSGVhZGluZz4NCiAgICA8RGVzY3JpcHRpb24+ZGVzY3JpcHRpb25TdHJpbmc8L0Rlc2NyaXB0aW9uPg0KICAgIDxQcmlvcml0eT5wcmlvcml0eVN0cmluZzwvUHJpb3JpdHk+DQogICAgPFN0YXR1cz5zdGF0dXNTdHJpbmc8L1N0YXR1cz4NCjwvQ3JlYXRlUmVxdWVzdD4="
    }

Resulting JSON in Elastic:

    {
      "_index": "pipeline-test",
      "_type": "logs",
      "_id": "nim7vGEBEcr_2DAurAfm",
      "_version": 1,
      "_score": null,
      "_source": {
        "metadata": "test",
        "brokername": "TESTNODE",
        "exceptionList": "",
        "executiongroup": "default",
        "appname": "TEST_APPLICATION",
        "attachment": {
          "content_type": "application/xml",
          "language": "no",
          "content": "headingString\n     descriptionString\n     priorityString\n     statusString",
          "content_length": 83
        },
        "payload": "PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4NCjxDcmVhdGVSZXF1ZXN0IHhtbG5zPSJodHRwOi8vdGVzdC5vcmcvVGVzdC9DcmVhdGVSZXF1ZXN0LzEuMCI+DQogICAgPEhlYWRpbmc+aGVhZGluZ1N0cmluZzwvSGVhZGluZz4NCiAgICA8RGVzY3JpcHRpb24+ZGVzY3JpcHRpb25TdHJpbmc8L0Rlc2NyaXB0aW9uPg0KICAgIDxQcmlvcml0eT5wcmlvcml0eVN0cmluZzwvUHJpb3JpdHk+DQogICAgPFN0YXR1cz5zdGF0dXNTdHJpbmc8L1N0YXR1cz4NCjwvQ3JlYXRlUmVxdWVzdD4=",
        "loglevel": "INFO",
        "flowname": "TEST_APPLICATION_SendToElastic",
        "logtext": "testing log functionality",
        "timestamp": "2018-02-22T09:57:00.654+01:00"
      },
      "fields": {
        "timestamp": [
          "2018-02-22T08:57:00.654Z"
        ]
      },
      "sort": [
        1519289820654
      ]
    }

As you can see above the resulting JSON only has values in the content:
"content": "headingString\n descriptionString\n priorityString\n statusString"

Tried setting the content_type to "text/plain" with the set processor to try and avoid the parsing but that didn't help.


(David Pilato) #2

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.

About your question, well, this is the way this plugin works. It is meant to index whatever binary files you have: doc, pdf, xml, txt... It does text extraction and metadata extraction.

If your documents are all XML files then you can use something like Logstash to parse them and generate JSON documents.
FSCrawler might help as well. See https://github.com/dadoonet/fscrawler#indexing-xml-docs


(Anders Ekenstierna) #3

Thanks for the swift reply. I did infact try to use the </> icon but it didn't do the trick for me :confused: . Markdown style worked better though..


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.