How to index full xml file into ElasticSearch

Hi,

I need to load full file into ES for our requirement and while searching i need to get the XML Nodes.

Eg:

My XMl FIle

lets say like this ( A small example)..

Doc 1) <Document><Record>A1</Record></Document
Doc 2) <Document><Record>A2</Record></Document

I need to index with Doc names.

and i need to get XML Nodes(Not just data)

<Record>A1<Record>
---------------------------------
<Record>A2<Record>

How could i do that with ElasticSearch.?

Elasticsearch stores JSON documents so if you want to search specific XML nodes you have to convert the XML document to a JSON document.

Hint: Format XML as code to avoid having it stripped from its tags. Use the preview pane.

HI Magnus.

I have updated the question. Could you please check and let me know, if it is possible.?

I Cannot alter the document or convert it to json. Because, client would want to see the document as is, whenever, they want.

You could store the XML documents as-is into ES, but then they will essentially be treated as text. That might not satisfy your search requirements. Or, convert them to JSON and store the original XML somewhere (either inside ES or outside) so that they work well for searches but also can be used to extract the original XML.

I need to index with Doc names.

and i need to get XML Nodes(Not just data)

I don't understand this part.

Hi Magnus,

We are planning to use Distributed computing framework like Spark to work on top of ES.
If we store whole xml data as text in ES, and parse it, it would be double work.

Splunk has a feature to work on top of xmls. If you specify a breaking String or Xpath, it will break the data repeatedly there and provides you events. To above xml, we will get list of <Record> nodes.

Does ES have any such solution.? Lets say if i specify as line breaker, can it give all the events of file by breaking by that string while search.

If we store whole xml data as text in ES, and parse it, it would be double work.

How so?

Elasticsearch tokenizes data upon input, not when searching. It should be possible to have an analyzer that tokenizes XML as you describe, but that will take place when you submit the documents. If you want to extract the original document it needs to be stored alongside.