XML child nodes with varying number into one ElasticSearch field?


#1

I've searched the forum and tried various processes, but I can't get to a solution. Any help would be really appreciated.

I have XML like this:

			<applicants>
			<applicant sequence="00" app-type="applicant-inventor" designation="us-only">
				<addressbook>
					<last-name>Goldkind</last-name>
					<first-name>Tina</first-name>
					<address>
						<city>St. James</city>
						<state>NY</state>
						<country>US</country>
					</address>
				</addressbook>
				<nationality>
					<country>US</country>
				</nationality>
				<residence>
					<country>US</country>
				</residence>
			</applicant>
			<applicant sequence="00" app-type="applicant-inventor" designation="us-only">
				<addressbook>
					<last-name>Smith</last-name>
					<first-name>John</first-name>
					<address>
						<city>St. James</city>
						<state>NY</state>
						<country>US</country>
					</address>
				</addressbook>
				<nationality>
					<country>US</country>
				</nationality>
				<residence>
					<country>US</country>
				</residence>
			</applicant>
		</applicants>

where the number of nodes varies.

I need to put the values for last-name and first-name into one field in this format:

Goldkind,Tina Smith,John

From what I've read, I can do this with Xpath 2.0, but the module which Logstash uses to process XML uses Xpath 1.0.

Is there a way I can accomplish this within Logstash? Maybe some Ruby code?

Thank you,


(Walker) #2

I need to put the values for and into one field in this format:

What??

If you are trying to combine the values of separate fields into one field, you can use the common function add_field

filter {
  mutate {
    id => "New field creation"
    add_field => {
      "NewFieldName" => %{last-name} %{first-name}
    }
  }
}

#3

Thanks @wwalker. I've been up all night, so I might not have been as clear as I should have been.

I'm taking data from XML first-name and last-name nodes, which have a varying number per document. In once document, there might only be one of each. In another there could be 12.

So, I'm not trying to combine existing fields. I'm trying to take ALL of the first-name and ALL of the last-name nodes from the XML doc and add those to a field.

Does that make sense?

Thanks,


(Walker) #4

Can you paste your current config? I'm assuming that your document has a new line for each <applicant> or you are using the multiline codec to do this. Logstash creates a new event for each new line in the document so combining them all, as far as I know, is not possible, without modifying the structure of the XML, but then you'd open a totally different can of worms.


#5

Current config, using multiline:

input {

file {
	path => [
		"/opt/uspto/*.xml"
		]
	start_position => "beginning"
	#user for testing
	sincedb_path => "/dev/null"
	# set this sincedb path when not testing
	#sincedb_path => "/opt/logstash/tmp/sincedb"
	exclude => "*.gz"
	type => "xml"
	codec => multiline {
		 #pattern => "<wo-ocr-published-application"
		 pattern => "<?xml version=\"1.0\" encoding=\"UTF-8\"\?>"
		 negate => "true"
		 what => "previous"
         max_lines => 3000
		}
}

}
filter {

if "multiline" in [tags] {
    mutate {
		gsub => [
		  # replace <p> with a blank
		  "message", "<p\s+id=\"\S+\"\s+num=\"\S+\">", "",
		  # replace </p> with a new line 
		  "message", "</p>", "\n",
		  # replace <claim-text> with a blank 
		  "message", "<claim-text>", "",
		  # replace </p> with a new line 
		  "message", "</claim-text>", "\n"
		]
	  }
      
	mutate {
		# add some new fields which we will populate with parsed data in the replace section
		add_field => {
            "country" => ""
            "docnumber" => ""
            "kind" => ""
            "date" => ""
            "title" => ""
            "abstract" => ""
            "applicants" => ""
		  }
		# i believe this would just create a field like 'claims.content => "claims.content"
		# Need to pull the data out of one field and create a new field with the actual content
		#replace => [ "[xmldata][claims][0][claim][0][content]", "%{[xmldata][claims][0][claim][0][content]}" ]
		#replace => [ "[xmldata][country]", "%{[xmldata][country]}" ]
	}
	
	grok {
		patterns_dir => ["/etc/logstash/patterns"]
		# identify the content between <claims lang""> and </claims>
		match => [ "message", "%{WIPOCLAIMS:claims_data}" ]
		# identify the two digit text between <claims lang""
		match => [ "message", "%{WIPOCLAIMSLANG:claims_language}" ]
    }
	
	xml {
		source => "message"
		#store_xml => false
		target => "xmldata"
		xpath => [
		"/us-patent-application/us-bibliographic-data-application/publication-reference/document-id/country/text()", "country",
		"/us-patent-application/us-bibliographic-data-application/publication-reference/document-id/doc-number/text()", "docnumber",
        "/us-patent-application/us-bibliographic-data-application/publication-reference/document-id/kind/text()", "kind",
        "/us-patent-application/us-bibliographic-data-application/publication-reference/document-id/date/text()", "date",
        "/us-patent-application/us-bibliographic-data-application/invention-title/text()", "title",
        "/us-patent-application/abstract/text()", "abstract"
        
        #"concat(//applicant/addressbook/last-name/text(),',',//applicant/addressbook/first-name/text())", "applicants" # this works, but creates multiple records
        #"string-join(//applicant/addressbook/(concat(last-name/text(), ',', first-name/text())), ' ')", "applicants"  # only in Xpath2.0
        #"/us-patent-application//us-bibliographic-data-application/applicant/addressbook/last-name/text()", "applicants" # only gets one value. Need multiple
		]
	}

}

}

output {

elasticsearch {
	codec => json
	hosts => "removed:443"
	index => "uspto"
}
stdout {
	codec => rubydebug
}

}

Is it possible for me to iterate over the XML nodes in Ruby, possibly creating an array and then place that into a field?

I'm open to any ideas which might get the job done and be reliable.

Thanks,


(Walker) #6

Unfortunately, ruby scripting is well outside my realm of knowledge.

Unrelated, it seems like your pipe is overly complex. You shouldn't have to create new fields prior to the xpathing, at least I didn't need to in my implementation of the XML filter using xpath. Additionally, you can use piping to facilitate OR statements in your regex for the gsub'ing.

mutate {
  id => "XML Tag Replacement"
  gsub => [
    #Replace <p> or <claim> with a blank
    "message", "(<claim>|<p>)", ""
    #Replace </p> or </claim> with new line
    "message", "(</claim>|</p>)", "\n"
  ]
}

#7

Thanks. I didn't think I needed to create new fields before using xpath, but it wasn't creating them.

i also agree that my gsub is not efficient. I usually start with plain code and then when it's working as expected I go back and refine, consolidate, etc.

But, thanks for the notes! I'll use them.

Anyone else have any ideas about how to get the data I'm looking for? I've seen similar posts - but nothing exactly like what I'm trying to do.


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.