How to index full xml file into ElasticSearch

sdaruna · January 20, 2016, 7:40pm

Hi,

I need to load full file into ES for our requirement and while searching i need to get the XML Nodes.

Eg:

My XMl FIle

lets say like this ( A small example)..

Doc 1) <Document><Record>A1</Record></Document
Doc 2) <Document><Record>A2</Record></Document

I need to index with Doc names.

and i need to get XML Nodes(Not just data)

<Record>A1<Record>
---------------------------------
<Record>A2<Record>

How could i do that with ElasticSearch.?

magnusbaeck · January 20, 2016, 7:44pm

Elasticsearch stores JSON documents so if you want to search specific XML nodes you have to convert the XML document to a JSON document.

Hint: Format XML as code to avoid having it stripped from its tags. Use the preview pane.

sdaruna · January 20, 2016, 7:48pm

HI Magnus.

I have updated the question. Could you please check and let me know, if it is possible.?

sdaruna · January 20, 2016, 7:48pm

I Cannot alter the document or convert it to json. Because, client would want to see the document as is, whenever, they want.

magnusbaeck · January 20, 2016, 9:08pm

You could store the XML documents as-is into ES, but then they will essentially be treated as text. That might not satisfy your search requirements. Or, convert them to JSON and store the original XML somewhere (either inside ES or outside) so that they work well for searches but also can be used to extract the original XML.

I need to index with Doc names.

and i need to get XML Nodes(Not just data)

I don't understand this part.

sdaruna · January 20, 2016, 9:14pm

Hi Magnus,

We are planning to use Distributed computing framework like Spark to work on top of ES.
If we store whole xml data as text in ES, and parse it, it would be double work.

Splunk has a feature to work on top of xmls. If you specify a breaking String or Xpath, it will break the data repeatedly there and provides you events. To above xml, we will get list of <Record> nodes.

Does ES have any such solution.? Lets say if i specify as line breaker, can it give all the events of file by breaking by that string while search.

magnusbaeck · January 21, 2016, 3:08pm

If we store whole xml data as text in ES, and parse it, it would be double work.

How so?

Elasticsearch tokenizes data upon input, not when searching. It should be possible to have an analyzer that tokenizes XML as you describe, but that will take place when you submit the documents. If you want to extract the original document it needs to be stored alongside.

Topic		Replies	Views
Storing XML in Elastic Search without analyzing/indexing Elasticsearch	5	5634	December 21, 2016
How to search xml data precisely Elasticsearch	3	2846	December 26, 2017
XML on Elasticsearch Logstash	10	754	August 29, 2018
Getting XML/SGML into ElasticSearch? A new Node.js module might help Elasticsearch	1	589	July 6, 2017
Xml files and HDFS Elasticsearch	4	317	July 6, 2017

How to index full xml file into ElasticSearch

Related topics