Here is a copy of my analyzer which includes the strip_html character filter. When retrieving documents from a field stored with this analyzer, it looks like the HTML codes are still in the document. Does the strip_html just stip the text for term indexing, or does it strip it from the content before it is stored? I was expecting the document field to be retrieved without any HTML markup.
I had run into the same problem before. The strip_html does not save
the document with the tags stripped out so any highlights you do from
the index (for instance) will still contain the original html. I
solved it by stripping html tags myself before indexing the document.
-Greg
Here is a copy of my analyzer which includes the strip_html character filter.
When retrieving documents from a field stored with this analyzer, it looks
like the HTML codes are still in the document. Does the strip_html just stip
the text for term indexing, or does it strip it from the content before it
is stored? I was expecting the document field to be retrieved without any
HTML markup.
Btw, we can have an html "type", which will strip the content and store it as is (and index it as well).
On Thursday, June 9, 2011 at 2:01 AM, Greg B wrote:
I had run into the same problem before. The strip_html does not save
the document with the tags stripped out so any highlights you do from
the index (for instance) will still contain the original html. I
solved it by stripping html tags myself before indexing the document.
-Greg
Here is a copy of my analyzer which includes the strip_html character filter.
When retrieving documents from a field stored with this analyzer, it looks
like the HTML codes are still in the document. Does the strip_html just stip
the text for term indexing, or does it strip it from the content before it
is stored? I was expecting the document field to be retrieved without any
HTML markup.
Having a core type of "html" would be a big convenience factor. The _source field could contain the raw document (with markup), and leave the fields as scrubbed and stripped. I would get the best of both worlds by having my terms not contain markup for tag clouds, and the stored body not having markup for highlighting.
It would be good to have some configuration options here however. Having an
option to tell the html cleaner which html tags to remove and which to keep
when storing the original html field content could be very useful (it can be
handy for document preview).
I think that jsoup could be used for this. It has a nice API for cleaning
HTML and allows to specify tag set to be remove (can be also customized).
Definitely; while .NET has some great support for HTML in the form of the HTML Agility Pack (great for stripping documents) it would be great to have ES have intimate knowledge of this document type.
I assume storage of the original document would be provided on top of the parses version?
It would be good to have some configuration options here however. Having an option to tell the html cleaner which html tags to remove and which to keep when storing the original html field content could be very useful (it can be handy for document preview).
I think that jsoup could be used for this. It has a nice API for cleaning HTML and allows to specify tag set to be remove (can be also customized).
On Fri, Jun 10, 2011 at 12:06 PM, Karel Minarik karel.minarik@gmail.com wrote:
That's a great idea, I've talked to many people who would seriously
enjoy this.
Make sense, lets open an issue so we can keep track of this. It should be pretty simple to add an html type (even as a plugin, similar to the attachments one). If someone is up for the challenge, I am here to help!
On Friday, June 10, 2011 at 2:29 PM, administrator wrote:
Definitely; while .NET has some great support for HTML in the form of the HTML Agility Pack (great for stripping documents) it would be great to have ES have intimate knowledge of this document type.
I assume storage of the original document would be provided on top of the parses version?
It would be good to have some configuration options here however. Having an option to tell the html cleaner which html tags to remove and which to keep when storing the original html field content could be very useful (it can be handy for document preview).
I think that jsoup could be used for this. It has a nice API for cleaning HTML and allows to specify tag set to be remove (can be also customized).
Make sense, lets open an issue so we can keep track of this. It should be
pretty simple to add an html type (even as a plugin, similar to the
attachments one). If someone is up for the challenge, I am here to help!
On Friday, June 10, 2011 at 2:29 PM, administrator wrote:
Definitely; while .NET has some great support for HTML in the form of the
HTML Agility Pack (great for stripping documents) it would be great to have
ES have intimate knowledge of this document type.
I assume storage of the original document would be provided on top of the
parses version?
It would be good to have some configuration options here however. Having an
option to tell the html cleaner which html tags to remove and which to keep
when storing the original html field content could be very useful (it can be
handy for document preview).
I think that jsoup could be used for this. It has a nice API for cleaning
HTML and allows to specify tag set to be remove (can be also customized).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.