Elasticsearch xml file

Hi i need index xml file to elasticsearch
my xml file is like this

<Talk Speaker = "Alastair Parvin" Title= " Architecture for the people by the people" >
	<Segment id ="1" >
		<Time-slot>00:00:12,884 --> 00:00:16,053</Time-slot>
		<Original_text lang="en"></Original_text>
		<Translation lang="ar"></Translation>
		<Translation lang="fr"></Translation>
	</Segment>
</Talk>
</MulTed>

I need help please

You need to transform it to JSON document first.

You can use logstash if needed or do that by yourself depending what is the real source of this content.

1 Like

Ok i try use logstash thanks

i try use logstash but i have error

input {
    file {
		path => "C:/Users/Dev/Desktop/file1.xml"
		start_position => "beginning"
		sincedb_path => "/dev/null"
		type => "xml"
		codec => multiline {
             pattern =>  "^<\?Multed .*\>"
             negate => "true"
             what => "previous"
}
	}
}
filter {
	xml {
    source => "message"
    target => "Multed"
	xpath =>["/Multed/Talk/Segment/@id","id",
		"/Multed/Talk/Segment/Original_text/text()","original_text"
		
	]
  }

       mutate { 
            remove_field => [ "message" ] 
        
            add_field => ["IDIndexed", "%{id}"] 
            add_field => ["Original_text", "%{original_text}"]           
                         
						}}
output{
    elasticsearch{
        hosts => ["localhost:9200"]
        index => "indexXml"
    }
    stdout{
	codec => rubydebug

    }
}

er
Error in %{id} and %{original_text}

Please don't post images of text as they are hardly readable and not searchable.

Instead paste the text and format it with </> icon. Check the preview window.

I moved your question to #logstash

Ok
{
"type" => "xml",
"IDIndexed" => "%{id}",
"@timestamp" => 2018-06-04T18:16:49.466Z,
"host" => "Dev-PC",
"path" => "C:/Users/Dev/Desktop/file1.xml",
"@version" => "1",
"Original_text" => "%{original_text}",
"tags" => [
[0] "multiline",
[1] "multiline_codec_max_lines_reached",
[2] "_xmlparsefailure"
]
}

The multiline codec is incorrectly configured. Which line from the XML file is ^<\?Multed .*\> supposed to match?

`^<\?Multed .*\>`  is root of document
        <Multed>
        <Talk Speaker = "Alastair Parvin" Title= " Architecture for the people by the people" >
        	<Segment id ="1" >
        		<Time-slot>00:00:12,884 --> 00:00:16,053</Time-slot>
        		<Original_text lang="en"></Original_text>
        		<Translation lang="ar"></Translation>
        		<Translation lang="fr"></Translation>
        	</Segment>
        </Talk>
        </MulTed>

The regular expression <\?Multed .*\> does not match any of the lines in your example document.

what's regular expression is correct

You could try

        codec => multiline {
            pattern =>  "^<MulTed>"
            negate => "true"
            what => "previous"
            auto_flush_interval => 2
        }

The same problem

If the XML is indented then get rid of the start-of-line anchor and use

pattern =>  "<MulTed>"

The same error value

        input {
            file {
        		path => "C:/Users/Dev/Desktop/file1.xml"
        		start_position => "beginning"
        		sincedb_path => "/dev/null"
        		type => "xml"
        		   codec => multiline {
                    pattern =>  "<MulTed>"
                    negate => "true"
                    what => "previous"
                    auto_flush_interval => 2
                }
        	}
        }
        filter {
    		
    	xml {
        source => "Talk"
        target => "MulTed"
    	xpath =>["MulTed/Talk/Segment/@id","id",
    		"MulTed/Talk/Segment/Original_text/text()","original_text"]
      }

           mutate { 
                remove_field => [ "message" ] 
            
                add_field => ["IDIndexed", "%{id}"] 
                add_field => ["Original_text", "%{original_text}"]           
                             
    						}}
    output{
        elasticsearch{
            hosts => ["localhost:9200"]
            index => "senind"
        }
        stdout{
    	codec => rubydebug

        }
    }

{
"tags" => [
[0] "multiline"
],
"Original_text" => "%{original_text}",
"@version" => "1",
"path" => "C:/Users/Dev/Desktop/file1.xml",
"type" => "xml",
"host" => "Dev-PC",
"@timestamp" => 2018-06-05T17:54:47.115Z,
"IDIndexed" => "%{id}"
}

There is no error there. The multiline codec worked.

But the value of% {id} and %{original_text} is not insert

That's because your xpath expressions are wrong. They refer to Multed, but the XML has MulTed. Or perhaps the other way around. Either way, it is case sensitive. Also, Original_text/text() is empty.

Note also that xpath always returns arrays, so you might want to

if [id] { mutate { replace => { "id" => "%{[id][0]}" } } }
1 Like

Ok thanks i try use it

the same error Thanks

OK, so comment out 'remove_field => [ "message" ]' and show us what an event looks like, either using stdout { codec => rubydebug }, or copy and paste from the JSON event in Kibana.