Xml filter plugin - creating nested field out of null object: Can't get text on a END_OBJECT

FYI, Here is the error seen:

"status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse [entry.AppId.raw]", "caused_by"=>{"type"=>"illegal_state_exception", "reason"=>"Can't get text on a END_OBJECT at 1:1416"}}}}, :level=>:warn}

I have been tracking this issue down for a while and I believe i have finally come to the root of the issue. I have an xml document that has fields sometimes null, and sometimes not. When the field is populated, xml filter parses it correctly into its own field, for example:

calling xml filter:

xml { target => entry source => message force_array => false }

example of xml:

<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<AppId>ConsumeAll</AppId>

will result in the following:

"entry": {
  "AppId": "ConsumeAll"
}

This is exactly how it should be, which matches my mapping set for the index, and if all records were like this, i assume i would not have any issues. HOWEVER, if the xml field is null, then we hit issues. Instead of being a null string/object, it gets created as the parent of a nested field such as:

example of xml:

<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<AppId/>

will result in the following:

"entry": {
  "AppId":  {
  
  }
}

At this point the entry fails because elasticsearch cannot match that to the template. I have tried removing the field with:

mutate{
  remove_field => [ "[entry][AppID]" ]
}

but the filter does not remove the field, i assume because it is a nested field even though it has nothing nested. Any help is greatly appreciated as i have been beating my head against the wall on this one.

I have tried deleted every field that gets pulled in blank, such as:

"entry": {
  "AppId":  {
  }
}

by doing the following:

mutate{
  remove_field => [ "[entry][AppId]" ]
}

yet it does not get removed. Perhaps this is because AppId is itself now the parent of a nested field, and thus not getting removed? Any ideas how i could remove that? Also, it would be nice if i could find a way to remove it ONLY if it is an empty nested field, and leave it when it is in fact populated.

Are there any conditions i could use to determine when entry.AppId is an empty nested field and is there a mutate function that will delete it?

Hi !

What exact Logstash version are you using? In addition, can you please share the complete LS configuration?

Thanks!

--Gabriel

we are using logstash-2.3.2. The configs are really long, and a compilation of over a dozen files. the config pertaining to this call is isolated to the xml piece. ie:

if "app" in [tags]{
    xml {
       target => entry
       source => message
       force_array => false
    }
    date{
       match => [ "[entry][Timestamp]", "YYYY-MM-dd HH:mm:ss.SSS Z", "ISO8601" ]
    }
}

also, to expand and add the input and output piece, the result config should be as simple as:

input{
 beats{
   port => 5044
 }
}
filter{
if "app" in [tags]{
    xml {
       target => entry
       source => message
       force_array => false
    }
    date{
       match => [ "[entry][Timestamp]", "YYYY-MM-dd HH:mm:ss.SSS Z", "ISO8601" ]
    }
}
}
if "app" in [tags]{
   elasticsearch {
   hosts => ["es01", "es02"]
   workers => 16
   index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
  }
 }

Hi!

So i think that i might be missing something. I am testing the following which should be a similar test than what you are doing:

input{

	stdin{}
}
filter{
   
   xml {
       target => entry
       source => message
       force_array => false
    }
    date{
       match => [ "[entry][Timestamp]", "YYYY-MM-dd HH:mm:ss.SSS Z", "ISO8601" ]
    }
}

output{
	stdout{codec=>rubydebug}
}

The result of this is the following:

<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<AppId>ConsumeAll</AppId>

{
       "message" => "<?xml version=\\\"1.0\\\" encoding=\\\"UTF-8\\\"?>\\n<AppId>ConsumeAll</AppId>",
      "@version" => "1",
    "@timestamp" => "2016-08-03T17:44:28.578Z",
          "host" => "Gabriels-MacBook-Pro.local",
         "entry" => "ConsumeAll"
}


<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<AppId/>

{
       "message" => "<?xml version=\\\"1.0\\\" encoding=\\\"UTF-8\\\"?>\\n<AppId/>",
      "@version" => "1",
    "@timestamp" => "2016-08-03T17:44:35.114Z",
          "host" => "Gabriels-MacBook-Pro.local",
         "entry" => {}
}

By the way, if i do the following, then the field inside entry is removed:

{
       "message" => "<?xml version=\\\"1.0\\\" encoding=\\\"UTF-8\\\"?>\\n<AppId/>",
      "@version" => "1",
    "@timestamp" => "2016-08-03T18:00:05.140Z",
          "host" => "Gabriels-MacBook-Pro.local",
         "entry" => {}
}

I just tried to force my document to contain that field with the following:

   xml {
       target => entry
       source => message
       force_array => false
    }

    mutate {
    	rename => ['entry', '[entry][AppID]']
    }

	mutate { 
		remove_field => [ "[entry][AppID]" ] 
	}

So it's working on my side, however i am sure that we are missing a single thing that is causing this.

Thanks!

--Gabriel

I did not provide the full xml since i was hopeful it would be a known issue and i was just missing something. try this xml pulled down from filebeat output:

"@timestamp": "2016-08-03T18:02:11.219Z",
  "beat": {
    "hostname": "app01",
    "name": "app01"
  },
  "count": 1,
  "input_type": "log",
  "message": "2016-08-03 13:54:32,084 INFO  app.Test.Logstash [Job0] - \u003c?xml version=\"1.0\" encoding=\"UTF-8\"?\u003e\n\u003cns0:AppLogTest xmlns:ns0=\"http://log.internal.pri/log/Namespaces/Interface.xsd\"\u003e\u003cns0:Name\u003eProc/Comm/framework.proc\u003c/ns0:Name\u003e\u003cns0:ApplId/\u003e\u003cns0:Origin/\u003e\u003cns0:Subject\u003eStart Test Logging\u003c/ns0:Subject\u003e\u003cns0:conId/\u003e\u003cns0:OriginName/\u003e\u003cns0:Class\u003elogManagement\u003c/ns0:Class\u003cns0:Timestamp\u003e2016-08-03T13:54:32.083-04:00\u003c/ns0:Timestamp\u003e\u003c/ns0:AppLogTest\u003e \t ",
  "offset": 236901,
  "source": "/app/log/test.log",
  "tags": [
    "app",
    "test",
    "log",
    "app01"
  ],
  "type": "logtest"

For CSV filter i see the option "skip_empty_columns". I was hoping there was something similar in xml filter but I have not found anything to that effect.

So it looks like the xml is the following:

<?xml version="1.0" encoding="UTF-8"?>
<ns0:AppLogTest xmlns:ns0=xxxxxxxxxx>
   <ns0:Name>xxxxxxxxxx</ns0:Name>
   <ns0:ApplId />
   <ns0:Origin />
   <ns0:Subject>Start Test Logging</ns0:Subject>
   <ns0:conId />
   <ns0:OriginName />
   <ns0:Class>logManagement</ns0:Class>
   <ns0:Timestamp>2016-08-03T13:54:32.083-04:00</ns0:Timestamp>
</ns0:AppLogTest>

That will generate a json with a field name ApplId. It needs to have the complete field name and lower case d. The following shuold work in this case:

	mutate { 
		remove_field => [ "[entry][ApplId]" ] 
	}

If you add the correct field name , then the document generated is the following:

{
       "message" => "<?xml version=\"1.0\" encoding=\"UTF-8\"?><ns0:AppLogTest xmlns:ns0=\"http://log.internal.pri/log/Namespaces/Interface.xsd\"><ns0:Name>Proc/Comm/framework.proc</ns0:Name><ns0:ApplId/><ns0:Origin/><ns0:Subject>Start Test Logging</ns0:Subject><ns0:conId/><ns0:OriginName/><ns0:Class>logManagement</ns0:Class><ns0:Timestamp>2016-08-03T13:54:32.083-04:00</ns0:Timestamp></ns0:AppLogTest>",
      "@version" => "1",
    "@timestamp" => "2016-08-03T18:36:19.066Z",
          "host" => "Gabriels-MacBook-Pro.local",
         "entry" => {
         "xmlns:ns0" => "http://log.internal.pri/log/Namespaces/Interface.xsd",
              "Name" => "Proc/Comm/framework.proc",
            "Origin" => {},
           "Subject" => "Start Test Logging",
             "conId" => {},
        "OriginName" => {},
             "Class" => "logManagement",
         "Timestamp" => "2016-08-03T13:54:32.083-04:00"
    }
}

Please let me know if this is the actual issue.

Thanks!

--Gabriel

Ok, so looks like I can remove completely and it resolves the issue. I guess im still curious how/why it gets pulled in as a nested field instead of empty string.

Finally, my last question is how would i remove that field ONLY when it is empty? Since it isn't an empty string field and instead an empty nested field the following hasn't worked:
this one fails to fulfill condition:

if "" in [entry][ApplId]{
	mutate { 
		remove_field => [ "[entry][ApplId]" ] 
	}
}

this one actuates EVERY time:

if [entry][ApplId][]{
    mutate { 
        remove_field => [ "[entry][ApplId]" ] 
    }
}

so is there a way i can write a condition which will remove the field only when it is empty, and not all or nothing? When there is information presented in ApplId i would prefer to keep that field.

Does this works for you?

if !([entry][ApplId] =~ /.+/) {
	mutate { 
		remove_field => [ "[entry][ApplId]" ] 
	}
}

This will check if the field has any value. If doesn't , it'll prune that field.

Please try out this and let us know if it works.

Thanks!

--Gabriel