XML through ingest attachment only showing values in attachment.content, not full tree

ekenstiernan · February 22, 2018, 9:07am

Hi.

We're in the process of starting to use Elastic for logging in our integration platform. When sending to Elastic through the ingest-attachment pipeline the payload in attachment.content field is only showing the values from the elements instead of the entire xml tree. Is there any way of solving this or is that just the way the attachment pipeline parses the xml? When sending a JSON in the payload the entire tree is present not just the values. Since we are just the intermediator here we will not be in control of what is sent in the payload.

Processor definition:

{
      "description" : "Extract payload information",
      "processors" : [
    {
      "attachment" : {
        "field" : "payload",
        "target_field" : "attachment"
        
      }
    }
      ]
    }

Message to log:

    {
    	"timestamp": "2018-02-22T09:57:00.654+01:00",
    	"brokername": "TESTNODE",
    	"executiongroup": "default",
    	"appname": "TEST_APPLICATION",
    	"flowname": "TEST_APPLICATION_SendToElastic",
    	"loglevel": "INFO",
    	"logtext": "testing log functionality",
    	"metadata": "test",
    	"exceptionList": "",
    	"payload": "PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4NCjxDcmVhdGVSZXF1ZXN0IHhtbG5zPSJodHRwOi8vdGVzdC5vcmcvVGVzdC9DcmVhdGVSZXF1ZXN0LzEuMCI+DQogICAgPEhlYWRpbmc+aGVhZGluZ1N0cmluZzwvSGVhZGluZz4NCiAgICA8RGVzY3JpcHRpb24+ZGVzY3JpcHRpb25TdHJpbmc8L0Rlc2NyaXB0aW9uPg0KICAgIDxQcmlvcml0eT5wcmlvcml0eVN0cmluZzwvUHJpb3JpdHk+DQogICAgPFN0YXR1cz5zdGF0dXNTdHJpbmc8L1N0YXR1cz4NCjwvQ3JlYXRlUmVxdWVzdD4="
    }

Resulting JSON in Elastic:

    {
      "_index": "pipeline-test",
      "_type": "logs",
      "_id": "nim7vGEBEcr_2DAurAfm",
      "_version": 1,
      "_score": null,
      "_source": {
        "metadata": "test",
        "brokername": "TESTNODE",
        "exceptionList": "",
        "executiongroup": "default",
        "appname": "TEST_APPLICATION",
        "attachment": {
          "content_type": "application/xml",
          "language": "no",
          "content": "headingString\n     descriptionString\n     priorityString\n     statusString",
          "content_length": 83
        },
        "payload": "PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4NCjxDcmVhdGVSZXF1ZXN0IHhtbG5zPSJodHRwOi8vdGVzdC5vcmcvVGVzdC9DcmVhdGVSZXF1ZXN0LzEuMCI+DQogICAgPEhlYWRpbmc+aGVhZGluZ1N0cmluZzwvSGVhZGluZz4NCiAgICA8RGVzY3JpcHRpb24+ZGVzY3JpcHRpb25TdHJpbmc8L0Rlc2NyaXB0aW9uPg0KICAgIDxQcmlvcml0eT5wcmlvcml0eVN0cmluZzwvUHJpb3JpdHk+DQogICAgPFN0YXR1cz5zdGF0dXNTdHJpbmc8L1N0YXR1cz4NCjwvQ3JlYXRlUmVxdWVzdD4=",
        "loglevel": "INFO",
        "flowname": "TEST_APPLICATION_SendToElastic",
        "logtext": "testing log functionality",
        "timestamp": "2018-02-22T09:57:00.654+01:00"
      },
      "fields": {
        "timestamp": [
          "2018-02-22T08:57:00.654Z"
        ]
      },
      "sort": [
        1519289820654
      ]
    }

As you can see above the resulting JSON only has values in the content:
"content": "headingString\n descriptionString\n priorityString\n statusString"

Tried setting the content_type to "text/plain" with the set processor to try and avoid the parsing but that didn't help.

dadoonet · February 22, 2018, 9:22am

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.

About your question, well, this is the way this plugin works. It is meant to index whatever binary files you have: doc, pdf, xml, txt... It does text extraction and metadata extraction.

If your documents are all XML files then you can use something like Logstash to parse them and generate JSON documents.
FSCrawler might help as well. See https://github.com/dadoonet/fscrawler#indexing-xml-docs

ekenstiernan · February 22, 2018, 11:29am

Thanks for the swift reply. I did infact try to use the </> icon but it didn't do the trick for me . Markdown style worked better though..

system · March 22, 2018, 11:29am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problem with Ingest Attachment Processor Plugin Elasticsearch	8	1204	November 24, 2017
Ingest-attachment ingest local docs Elasticsearch	4	453	November 18, 2018
Ingest attachment plugin not analysing some html files Elasticsearch	15	1207	March 30, 2018
Searching content doesn't show exact output Elasticsearch	8	1774	March 28, 2018
Searching attachment content with ingest attachment plugin ES 5.2 Elasticsearch	8	5414	March 13, 2017

XML through ingest attachment only showing values in attachment.content, not full tree

Related topics