Hello I have a text content in XML format with formatting tags inside text:
... water chemical formula is H2O and the energy is
How to properly convert this example to JSON format for elastic search, and
to keep the search features and highlighting consistent?
Your challenges are:
the "abstract" XML element is semantic markup, it means "here comes an
"sub" /"sup" elements are (X)HTML markup and they mean "display me in
superscript/subscript style on your favorite output device"
The "abstract" element need to be parsed, and you need to decide how to
index abstracts in ES. The "sub"/sup" elements need to be dropped. Display
markup in your index mixed up with your textual content will render your
That's the reason why the Lucene community use HTML strip filter in the
analysis phase, and so does
The ES highlighting uses HTML-like tags, but this is just for convenience,
it could also be other pre_tags/post_tags,
also non-XML: http://www.elasticsearch.org/guide/reference/api/search/highlighting.html
An alternative to stripping HTML tags is to convert the text to Markdown
(or another tag-less markup language) before indexing
An XSL styleheet is here
Assuming the markdown control characters are not interfering with your
Lucene analysis and word search, you could even add a Markdown formatter to
present your ES docs / snippets.
Another note, there are Unicode characters SUPERSCRIPT TWO U+00B2 und
SUBSCRIPT TWO U+2082 which may help as a replacement before indexing.