Wouldn't it be easier to do it on the client side? I think doing it on the
server side can be too complex: consider that if you add HTML as a type to
mapping then you have to specify and handle a lot of other things because
every html has some structure, it has head and body, the head usually
contains a lot of metadata, body contains paragraphs, divs, headlines,
links, ... etc.
Can you give some example how you would like to use it?
On Fri, Aug 6, 2010 at 9:38 PM, James Cook email@example.com wrote:
It would be cool to add tagsoup as a mapping type. Would that be the most
appropriate place for this functionality to reside?
On Thu, Aug 5, 2010 at 2:27 PM, Lukáš Vlček firstname.lastname@example.org wrote:
AFAIK Tika is using TagSoup for parsing HTML documents. So you can either
use directly TagSoup or you can try Tika's Swing GUI to test output of your
See http://tika.apache.org/0.7/gettingstarted.html where in "Using Tika
as a command line utility" you can find option -g or --gui
But if you have a json document that can contain HTML inside (as a value
of some property) then I think you will have to do this manually. The way
attachment plugin works is that it assumes that whole input document is of
the same content-type, there is no support for parsing documents with nested
On Thu, Aug 5, 2010 at 7:50 PM, James Cook email@example.com wrote:
I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.
We have a situation where we are storing a field in our JSON object that
can contain HTML content. (Think the content of a discussion thread.)
Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?