Hana_Ne
(Hana Ne)
June 4, 2018, 5:37pm
1
Hi i need index xml file to elasticsearch
my xml file is like this
<Talk Speaker = "Alastair Parvin" Title= " Architecture for the people by the people" >
<Segment id ="1" >
<Time-slot>00:00:12,884 --> 00:00:16,053</Time-slot>
<Original_text lang="en"></Original_text>
<Translation lang="ar"></Translation>
<Translation lang="fr"></Translation>
</Segment>
</Talk>
</MulTed>
I need help please
dadoonet
(David Pilato)
June 4, 2018, 5:56pm
2
You need to transform it to JSON document first.
You can use logstash if needed or do that by yourself depending what is the real source of this content.
1 Like
Hana_Ne
(Hana Ne)
June 4, 2018, 6:04pm
3
Ok i try use logstash thanks
Hana_Ne
(Hana Ne)
June 4, 2018, 6:16pm
4
i try use logstash but i have error
input {
file {
path => "C:/Users/Dev/Desktop/file1.xml"
start_position => "beginning"
sincedb_path => "/dev/null"
type => "xml"
codec => multiline {
pattern => "^<\?Multed .*\>"
negate => "true"
what => "previous"
}
}
}
filter {
xml {
source => "message"
target => "Multed"
xpath =>["/Multed/Talk/Segment/@id","id",
"/Multed/Talk/Segment/Original_text/text()","original_text"
]
}
mutate {
remove_field => [ "message" ]
add_field => ["IDIndexed", "%{id}"]
add_field => ["Original_text", "%{original_text}"]
}}
output{
elasticsearch{
hosts => ["localhost:9200"]
index => "indexXml"
}
stdout{
codec => rubydebug
}
}
Error in %{id} and %{original_text}
dadoonet
(David Pilato)
June 4, 2018, 6:38pm
5
Please don't post images of text as they are hardly readable and not searchable.
Instead paste the text and format it with </>
icon. Check the preview window.
I moved your question to #logstash
Hana_Ne
(Hana Ne)
June 4, 2018, 6:43pm
6
Ok
{
"type" => "xml",
"IDIndexed" => "%{id}",
"@timestamp " => 2018-06-04T18:16:49.466Z,
"host" => "Dev-PC",
"path" => "C:/Users/Dev/Desktop/file1.xml",
"@version " => "1",
"Original_text" => "%{original_text}",
"tags" => [
[0] "multiline",
[1] "multiline_codec_max_lines_reached",
[2] "_xmlparsefailure"
]
}
The multiline codec is incorrectly configured. Which line from the XML file is ^<\?Multed .*\>
supposed to match?
Hana_Ne
(Hana Ne)
June 5, 2018, 12:38am
8
`^<\?Multed .*\>` is root of document
<Multed>
<Talk Speaker = "Alastair Parvin" Title= " Architecture for the people by the people" >
<Segment id ="1" >
<Time-slot>00:00:12,884 --> 00:00:16,053</Time-slot>
<Original_text lang="en"></Original_text>
<Translation lang="ar"></Translation>
<Translation lang="fr"></Translation>
</Segment>
</Talk>
</MulTed>
The regular expression <\?Multed .*\>
does not match any of the lines in your example document.
Hana_Ne
(Hana Ne)
June 5, 2018, 4:28pm
10
what's regular expression is correct
Badger
June 5, 2018, 5:18pm
11
You could try
codec => multiline {
pattern => "^<MulTed>"
negate => "true"
what => "previous"
auto_flush_interval => 2
}
Badger
June 5, 2018, 5:47pm
13
If the XML is indented then get rid of the start-of-line anchor and use
pattern => "<MulTed>"
Hana_Ne
(Hana Ne)
June 5, 2018, 5:56pm
14
The same error value
input {
file {
path => "C:/Users/Dev/Desktop/file1.xml"
start_position => "beginning"
sincedb_path => "/dev/null"
type => "xml"
codec => multiline {
pattern => "<MulTed>"
negate => "true"
what => "previous"
auto_flush_interval => 2
}
}
}
filter {
xml {
source => "Talk"
target => "MulTed"
xpath =>["MulTed/Talk/Segment/@id","id",
"MulTed/Talk/Segment/Original_text/text()","original_text"]
}
mutate {
remove_field => [ "message" ]
add_field => ["IDIndexed", "%{id}"]
add_field => ["Original_text", "%{original_text}"]
}}
output{
elasticsearch{
hosts => ["localhost:9200"]
index => "senind"
}
stdout{
codec => rubydebug
}
}
{
"tags" => [
[0] "multiline"
],
"Original_text" => "%{original_text}",
"@version " => "1",
"path" => "C:/Users/Dev/Desktop/file1.xml",
"type" => "xml",
"host" => "Dev-PC",
"@timestamp " => 2018-06-05T17:54:47.115Z,
"IDIndexed" => "%{id}"
}
Badger
June 5, 2018, 6:27pm
15
There is no error there. The multiline codec worked.
Hana_Ne
(Hana Ne)
June 5, 2018, 6:29pm
16
But the value of% {id} and %{original_text} is not insert
Badger
June 5, 2018, 6:33pm
17
That's because your xpath expressions are wrong. They refer to Multed, but the XML has MulTed. Or perhaps the other way around. Either way, it is case sensitive. Also, Original_text/text() is empty.
Note also that xpath always returns arrays, so you might want to
if [id] { mutate { replace => { "id" => "%{[id][0]}" } } }
1 Like
Badger
June 5, 2018, 9:51pm
20
OK, so comment out 'remove_field => [ "message" ]' and show us what an event looks like, either using stdout { codec => rubydebug }, or copy and paste from the JSON event in Kibana.