Formatted text mapping for elasticsearch

Emery_2 · December 29, 2012, 12:34am

Hello I have a text content in XML format with formatting tags inside text:
... water chemical formula is H₂O and the energy is
E=MC² formula.
How to properly convert this example to JSON format for elastic search, and
to keep the search features and highlighting consistent?

--

jprante · December 29, 2012, 7:06pm

Your challenges are:

the "abstract" XML element is semantic markup, it means "here comes an
abstract"
"sub" /"sup" elements are (X)HTML markup and they mean "display me in
superscript/subscript style on your favorite output device"

The "abstract" element need to be parsed, and you need to decide how to
index abstracts in ES. The "sub"/sup" elements need to be dropped. Display
markup in your index mixed up with your textual content will render your
index unusable.

That's the reason why the Lucene community use HTML strip filter in the
analysis phase, and so does
ES: http://www.elasticsearch.org/guide/reference/index-modules/analysis/htmlstrip-charfilter.html

The ES highlighting uses HTML-like tags, but this is just for convenience,
it could also be other pre_tags/post_tags,
also non-XML: http://www.elasticsearch.org/guide/reference/api/search/highlighting.html

An alternative to stripping HTML tags is to convert the text to Markdown
(or another tag-less markup language) before indexing

An XSL styleheet is here
http://getsymphony.com/download/xslt-utilities/view/20573/

Assuming the markdown control characters are not interfering with your
Lucene analysis and word search, you could even add a Markdown formatter to
present your ES docs / snippets.

Jörg

--

jprante · December 29, 2012, 7:32pm

Another note, there are Unicode characters SUPERSCRIPT TWO U+00B2 und
SUBSCRIPT TWO U+2082 which may help as a replacement before indexing.

Jörg

--