Elasticsearch xml file


(Hana Ne) #1

Hi i need index xml file to elasticsearch
my xml file is like this

<Talk Speaker = "Alastair Parvin" Title= " Architecture for the people by the people" >
	<Segment id ="1" >
		<Time-slot>00:00:12,884 --> 00:00:16,053</Time-slot>
		<Original_text lang="en"></Original_text>
		<Translation lang="ar"></Translation>
		<Translation lang="fr"></Translation>
	</Segment>
</Talk>
</MulTed>

I need help please


(David Pilato) #2

You need to transform it to JSON document first.

You can use logstash if needed or do that by yourself depending what is the real source of this content.


(Hana Ne) #3

Ok i try use logstash thanks


(Hana Ne) #4

i try use logstash but i have error

input {
    file {
		path => "C:/Users/Dev/Desktop/file1.xml"
		start_position => "beginning"
		sincedb_path => "/dev/null"
		type => "xml"
		codec => multiline {
             pattern =>  "^<\?Multed .*\>"
             negate => "true"
             what => "previous"
}
	}
}
filter {
	xml {
    source => "message"
    target => "Multed"
	xpath =>["/Multed/Talk/Segment/@id","id",
		"/Multed/Talk/Segment/Original_text/text()","original_text"
		
	]
  }

       mutate { 
            remove_field => [ "message" ] 
        
            add_field => ["IDIndexed", "%{id}"] 
            add_field => ["Original_text", "%{original_text}"]           
                         
						}}
output{
    elasticsearch{
        hosts => ["localhost:9200"]
        index => "indexXml"
    }
    stdout{
	codec => rubydebug

    }
}

er
Error in %{id} and %{original_text}


(David Pilato) #5

Please don't post images of text as they are hardly readable and not searchable.

Instead paste the text and format it with </> icon. Check the preview window.

I moved your question to #logstash


(Hana Ne) #6

Ok
{
"type" => "xml",
"IDIndexed" => "%{id}",
"@timestamp" => 2018-06-04T18:16:49.466Z,
"host" => "Dev-PC",
"path" => "C:/Users/Dev/Desktop/file1.xml",
"@version" => "1",
"Original_text" => "%{original_text}",
"tags" => [
[0] "multiline",
[1] "multiline_codec_max_lines_reached",
[2] "_xmlparsefailure"
]
}


(Magnus Bäck) #7

The multiline codec is incorrectly configured. Which line from the XML file is ^<\?Multed .*\> supposed to match?


(Hana Ne) #8
`^<\?Multed .*\>`  is root of document
        <Multed>
        <Talk Speaker = "Alastair Parvin" Title= " Architecture for the people by the people" >
        	<Segment id ="1" >
        		<Time-slot>00:00:12,884 --> 00:00:16,053</Time-slot>
        		<Original_text lang="en"></Original_text>
        		<Translation lang="ar"></Translation>
        		<Translation lang="fr"></Translation>
        	</Segment>
        </Talk>
        </MulTed>

(Magnus Bäck) #9

The regular expression <\?Multed .*\> does not match any of the lines in your example document.


(Hana Ne) #10

what's regular expression is correct


#11

You could try

        codec => multiline {
            pattern =>  "^<MulTed>"
            negate => "true"
            what => "previous"
            auto_flush_interval => 2
        }

(Hana Ne) #12

The same problem


#13

If the XML is indented then get rid of the start-of-line anchor and use

pattern =>  "<MulTed>"

(Hana Ne) #14

The same error value

        input {
            file {
        		path => "C:/Users/Dev/Desktop/file1.xml"
        		start_position => "beginning"
        		sincedb_path => "/dev/null"
        		type => "xml"
        		   codec => multiline {
                    pattern =>  "<MulTed>"
                    negate => "true"
                    what => "previous"
                    auto_flush_interval => 2
                }
        	}
        }
        filter {
    		
    	xml {
        source => "Talk"
        target => "MulTed"
    	xpath =>["MulTed/Talk/Segment/@id","id",
    		"MulTed/Talk/Segment/Original_text/text()","original_text"]
      }

           mutate { 
                remove_field => [ "message" ] 
            
                add_field => ["IDIndexed", "%{id}"] 
                add_field => ["Original_text", "%{original_text}"]           
                             
    						}}
    output{
        elasticsearch{
            hosts => ["localhost:9200"]
            index => "senind"
        }
        stdout{
    	codec => rubydebug

        }
    }

{
"tags" => [
[0] "multiline"
],
"Original_text" => "%{original_text}",
"@version" => "1",
"path" => "C:/Users/Dev/Desktop/file1.xml",
"type" => "xml",
"host" => "Dev-PC",
"@timestamp" => 2018-06-05T17:54:47.115Z,
"IDIndexed" => "%{id}"
}


#15

There is no error there. The multiline codec worked.


(Hana Ne) #16

But the value of% {id} and %{original_text} is not insert


#17

That's because your xpath expressions are wrong. They refer to Multed, but the XML has MulTed. Or perhaps the other way around. Either way, it is case sensitive. Also, Original_text/text() is empty.

Note also that xpath always returns arrays, so you might want to

if [id] { mutate { replace => { "id" => "%{[id][0]}" } } }

(Hana Ne) #18

Ok thanks i try use it


(Hana Ne) #19

the same error Thanks


#20

OK, so comment out 'remove_field => [ "message" ]' and show us what an event looks like, either using stdout { codec => rubydebug }, or copy and paste from the JSON event in Kibana.