AFAIK Tika is using TagSoup for parsing HTML documents. So you can either
use directly TagSoup or you can try Tika's Swing GUI to test output of your
documents first.
See Apache Tika – Getting Started with Apache Tika where in "Using Tika as a
command line utility" you can find option -g or --gui
But if you have a json document that can contain HTML inside (as a value of
some property) then I think you will have to do this manually. The way
attachment plugin works is that it assumes that whole input document is of
the same content-type, there is no support for parsing documents with nested
content-types.
AFAIK Tika is using TagSoup for parsing HTML documents. So you can either
use directly TagSoup or you can try Tika's Swing GUI to test output of your
documents first.
See Apache Tika – Getting Started with Apache Tika where in "Using Tika as
a command line utility" you can find option -g or --gui
But if you have a json document that can contain HTML inside (as a value of
some property) then I think you will have to do this manually. The way
attachment plugin works is that it assumes that whole input document is of
the same content-type, there is no support for parsing documents with nested
content-types.
Wouldn't it be easier to do it on the client side? I think doing it on the
server side can be too complex: consider that if you add HTML as a type to
mapping then you have to specify and handle a lot of other things because
every html has some structure, it has head and body, the head usually
contains a lot of metadata, body contains paragraphs, divs, headlines,
links, ... etc.
Can you give some example how you would like to use it?
AFAIK Tika is using TagSoup for parsing HTML documents. So you can either
use directly TagSoup or you can try Tika's Swing GUI to test output of your
documents first.
See Apache Tika – Getting Started with Apache Tika where in "Using Tika
as a command line utility" you can find option -g or --gui
But if you have a json document that can contain HTML inside (as a value
of some property) then I think you will have to do this manually. The way
attachment plugin works is that it assumes that whole input document is of
the same content-type, there is no support for parsing documents with nested
content-types.
Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.
Thanks for opening the feature request. We could implement this on our
client side, but it would cause a large increase in our storage needs to
store the HTML and also provide a field that can be tokenized containing the
raw text content from that same HTML field.
Is it feasible for someone with little knowledge of ES internals to add a
built-in tokenizer like this? (Not sure if tokenizer is the right
terminology. Is it a mapping or an analyzer?)
Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.
Do you just want to strip out the html characters, or also, as a result of
the parsing of the html, add properties automatically like title, tags and
so on (on top of the default body level text).
Thanks for opening the feature request. We could implement this on our
client side, but it would cause a large increase in our storage needs to
store the HTML and also provide a field that can be tokenized containing the
raw text content from that same HTML field.
Is it feasible for someone with little knowledge of ES internals to add a
built-in tokenizer like this? (Not sure if tokenizer is the right
terminology. Is it a mapping or an analyzer?)
Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.
just a note, if the later is required then I think this can get more
complex. Especially when you realize that HTML5 is adding a lot of new (and
useful) stuff: http://diveintohtml5.org/semantics.html#new-elements
Do you just want to strip out the html characters, or also, as a result of
the parsing of the html, add properties automatically like title, tags and
so on (on top of the default body level text).
Thanks for opening the feature request. We could implement this on our
client side, but it would cause a large increase in our storage needs to
store the HTML and also provide a field that can be tokenized containing the
raw text content from that same HTML field.
Is it feasible for someone with little knowledge of ES internals to add a
built-in tokenizer like this? (Not sure if tokenizer is the right
terminology. Is it a mapping or an analyzer?)
Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.
I was hoping for just the stripping of tags. I'm indexing html fragments
that a user creates using a CMS. So they are editing using something like
TinyMCE.
Do you just want to strip out the html characters, or also, as a result of
the parsing of the html, add properties automatically like title, tags and
so on (on top of the default body level text).
Thanks for opening the feature request. We could implement this on our
client side, but it would cause a large increase in our storage needs to
store the HTML and also provide a field that can be tokenized containing the
raw text content from that same HTML field.
Is it feasible for someone with little knowledge of ES internals to add a
built-in tokenizer like this? (Not sure if tokenizer is the right
terminology. Is it a mapping or an analyzer?)
Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.
I was hoping for just the stripping of tags. I'm indexing html fragments
that a user creates using a CMS. So they are editing using something like
TinyMCE.
Do you just want to strip out the html characters, or also, as a result of
the parsing of the html, add properties automatically like title, tags and
so on (on top of the default body level text).
Thanks for opening the feature request. We could implement this on our
client side, but it would cause a large increase in our storage needs to
store the HTML and also provide a field that can be tokenized containing the
raw text content from that same HTML field.
Is it feasible for someone with little knowledge of ES internals to add a
built-in tokenizer like this? (Not sure if tokenizer is the right
terminology. Is it a mapping or an analyzer?)
Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.
I was hoping for just the stripping of tags. I'm indexing html fragments
that a user creates using a CMS. So they are editing using something like
TinyMCE.
Do you just want to strip out the html characters, or also, as a result
of the parsing of the html, add properties automatically like title, tags
and so on (on top of the default body level text).
Thanks for opening the feature request. We could implement this on our
client side, but it would cause a large increase in our storage needs to
store the HTML and also provide a field that can be tokenized containing the
raw text content from that same HTML field.
Is it feasible for someone with little knowledge of ES internals to add
a built-in tokenizer like this? (Not sure if tokenizer is the right
terminology. Is it a mapping or an analyzer?)
Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.