How do I parse a tmx file (xml file for translation data) in logstash

I am using TMX files (xml file for translation data) as my source in Logstash to index data in Elasticsearch.

A sample TMX file looks like this,

<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4">
  <header creationtool="ModernMT - modernmt.eu" creationtoolversion="1.0" datatype="plaintext" o-tmf="ModernMT" segtype="sentence" adminlang="en-us" srclang="en-GB"/>
  <body>
    <tu srclang="en-GB" datatype="plaintext" creationdate="20121019T114713Z">
  <tuv xml:lang="en-GB">
    <seg>The purpose of the standard is to establish and define the requirements for the provision of quality services by translation service providers.</seg>
  </tuv>
  <tuv xml:lang="it">
    <seg>L'obiettivo dello standard è stabilire e definire i requisiti affinché i fornitori di servizi di traduzione garantiscano servizi di qualità.</seg>
  </tuv>
</tu>
<tu srclang="en-GB" datatype="plaintext" creationdate="20111223T112746Z">
  <tuv xml:lang="en-GB">
    <seg>With 1,800 experienced and qualified resources translating regularly into over 200 language combinations, you can count on us for high quality professional translation services.</seg>
  </tuv>
  <tuv xml:lang="it">
    <seg>Abbiamo 1.800 professionisti esperti e qualificati che traducono regolarmente in oltre 200 combinazioni linguistiche; perciò, se cercate la qualità, potete contare su di noi.</seg>
  </tuv>
</tu>
<tu srclang="en-GB" datatype="plaintext" creationdate="20111223T112746Z">
  <tuv xml:lang="en-GB">
    <seg>Access our section of useful links</seg>
  </tuv>
  <tuv xml:lang="it">
    <seg>Da qui potrete accedere a una sezione che propone link a siti che possono essere di vostro interesse</seg>
  </tuv>
</tu>

What I need to do here is to access each <tu> block as an event, where the two <tuv> blocks inside will be used as the data fields. The data stored in the first tuv block will be indexed in ES as the source language data field and the data stored in the second tuv block is the target language data field.

A TMX document can contain more than 10000 tuv blocks.

I am having troubles using the xml filter and it looks like this now,

input {
    file {
        path => "/en-gb_pt-pt/81384/81384.xml"
            start_position => "beginning"
        codec => multiline {
                pattern => "<tu>" 
                    negate => "true"
                    what => "previous"
        }
    }
}

filter {
    xml {
        source => "message"
            target => "xml_content"
            xpath => [ "//seg", "seg" ] 
    }
}

output {
    stdout {
            #codec => json
            codec => rubydebug
    }
}

Here is a part of my index template,

	"segment": {
    	"_parent": {
        	"type": "tm"
        },
        "_routing": {
          "required": "true"
        },
        "properties": {
        	"@timestamp": {
        		"type": "date",
            	"format": "strict_date_optional_time||epoch_millis"
          	},
	        "@version": {
	        	"type": "string"
	        },
	        "source": {
	        	"type": "string",
	            "store": "true",
				"fields": {
					"length": { 
				    	"type":     "token_count",
				        "analyzer": "standard"
					}
				}
	        },
	        "target": {
	        	"type": "string",
	            "store": "true",
				"fields": {
					"length": { 
				    	"type":     "token_count",
				        "analyzer": "standard"
					}
				}
	        }
		}
	}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.