Strip_HTML on indexing does not store results?


(phobos182) #1

Here is a copy of my analyzer which includes the strip_html character filter. When retrieving documents from a field stored with this analyzer, it looks like the HTML codes are still in the document. Does the strip_html just stip the text for term indexing, or does it strip it from the content before it is stored? I was expecting the document field to be retrieved without any HTML markup.

Thanks,

-- Config --

curl -XPOST 'http://localhost:9200/test' -d '
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"test_analyzer": {
"type": "custom",
"char_filter": [
"scrub_html"
],
"tokenizer": "standard",
"filter": [
"standard",
"lowercase"
]
}
},
"char_filter": {
"scrub_html": {
"type": "html_strip",
"read_ahead": 4096
}
}
}
}
},
"mappings": {
"media": {
"_source": {
"compress": true
},
"_size": {
"enabled": true,
"store": "yes"
},
"properties": {
"content": {
"include_in_all": true,
"omit_norms": true,
"store": "yes",
"null_value": "na",
"analyzer": "test_analyzer",
"term_vector": "with_positions_offsets",
"type": "string"
}
}
}
}
}
'
curl -XPOST 'http://localhost:9200/test/media' -d '
{
"content": "

Thank you!

"
}
'
curl -XGET 'http://localhost:9200/test/_search?q=*&fields=content'
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 8,
"successful": 8,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "media",
"_id": "INyEgpcISFOOb8QrBUuUKQ",
"_score": 1,
"fields": {
"content": "

Thank you!

"
}
}
]
}
}

(Greg Brown) #2

I had run into the same problem before. The strip_html does not save
the document with the tags stripped out so any highlights you do from
the index (for instance) will still contain the original html. I
solved it by stripping html tags myself before indexing the document.
-Greg

On Jun 8, 9:47 am, phobos182 phobos...@gmail.com wrote:

Here is a copy of my analyzer which includes the strip_html character filter.
When retrieving documents from a field stored with this analyzer, it looks
like the HTML codes are still in the document. Does the strip_html just stip
the text for term indexing, or does it strip it from the content before it
is stored? I was expecting the document field to be retrieved without any
HTML markup.

Thanks,

-- Config --

curl -XPOST 'http://localhost:9200/test'-d '
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"test_analyzer": {
"type": "custom",
"char_filter": [
"scrub_html"
],
"tokenizer": "standard",
"filter": [
"standard",
"lowercase"
]
}
},
"char_filter": {
"scrub_html": {
"type": "html_strip",
"read_ahead": 4096
}
}
}
}
},
"mappings": {
"media": {
"_source": {
"compress": true
},
"_size": {
"enabled": true,
"store": "yes"
},
"properties": {
"content": {
"include_in_all": true,
"omit_norms": true,
"store": "yes",
"null_value": "na",
"analyzer": "test_analyzer",
"term_vector": "with_positions_offsets",
"type": "string"
}
}
}
}}

'
curl -XPOST 'http://localhost:9200/test/media'-d '
{
"content": "

Thank you!

"}

'
curl -XGET 'http://localhost:9200/test/_search?q=*&fields=content'
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 8,
"successful": 8,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "media",
"_id": "INyEgpcISFOOb8QrBUuUKQ",
"_score": 1,
"fields": {
"content": "

Thank you!

"
}
}
]
}

}

--
View this message in context:http://elasticsearch-users.115913.n3.nabble.com/Strip-HTML-on-indexin...
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(fashionalwallet) #3
  • deleted -

(Shay Banon) #4

Btw, we can have an html "type", which will strip the content and store it as is (and index it as well).

On Thursday, June 9, 2011 at 2:01 AM, Greg B wrote:

I had run into the same problem before. The strip_html does not save
the document with the tags stripped out so any highlights you do from
the index (for instance) will still contain the original html. I
solved it by stripping html tags myself before indexing the document.
-Greg

On Jun 8, 9:47 am, phobos182 <phobos...@gmail.com (http://gmail.com)> wrote:

Here is a copy of my analyzer which includes the strip_html character filter.
When retrieving documents from a field stored with this analyzer, it looks
like the HTML codes are still in the document. Does the strip_html just stip
the text for term indexing, or does it strip it from the content before it
is stored? I was expecting the document field to be retrieved without any
HTML markup.

Thanks,

-- Config --

curl -XPOST 'http://localhost:9200/test'-d '
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"test_analyzer": {
"type": "custom",
"char_filter": [
"scrub_html"
],
"tokenizer": "standard",
"filter": [
"standard",
"lowercase"
]
}
},
"char_filter": {
"scrub_html": {
"type": "html_strip",
"read_ahead": 4096
}
}
}
}
},
"mappings": {
"media": {
"_source": {
"compress": true
},
"_size": {
"enabled": true,
"store": "yes"
},
"properties": {
"content": {
"include_in_all": true,
"omit_norms": true,
"store": "yes",
"null_value": "na",
"analyzer": "test_analyzer",
"term_vector": "with_positions_offsets",
"type": "string"
}
}
}
}}

'
curl -XPOST 'http://localhost:9200/test/media'-d '
{
"content": "

Thank you!

"}

'
curl -XGET 'http://localhost:9200/test/_search?q=*&fields=content'
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 8,
"successful": 8,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "media",
"_id": "INyEgpcISFOOb8QrBUuUKQ",
"_score": 1,
"fields": {
"content": "

Thank you!

"
}
}
]
}

}

--
View this message in context:http://elasticsearch-users.115913.n3.nabble.com/Strip-HTML-on-indexin...
Sent from the ElasticSearch Users mailing list archive at Nabble.com (http://Nabble.com).


(phobos182) #5

Having a core type of "html" would be a big convenience factor. The _source field could contain the raw document (with markup), and leave the fields as scrubbed and stripped. I would get the best of both worlds by having my terms not contain markup for tag clouds, and the stored body not having markup for highlighting.


(Karel Minarik) #6

That's a great idea, I've talked to many people who would seriously
enjoy this.

On Jun 9, 8:57 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Btw, we can have an html "type", which will strip the content and store it as is (and index it as well).


(Lukáš Vlček) #7

+1!

It would be good to have some configuration options here however. Having an
option to tell the html cleaner which html tags to remove and which to keep
when storing the original html field content could be very useful (it can be
handy for document preview).

I think that jsoup could be used for this. It has a nice API for cleaning
HTML and allows to specify tag set to be remove (can be also customized).

Check
http://jsoup.org/apidocs/org/jsoup/Jsoup.html#clean(java.lang.String,
org.jsoup.safety.Whitelist)
http://jsoup.org/apidocs/org/jsoup/safety/Whitelist.html

Just my cents.
Lukas

On Fri, Jun 10, 2011 at 12:06 PM, Karel Minarik karel.minarik@gmail.comwrote:

That's a great idea, I've talked to many people who would seriously
enjoy this.

On Jun 9, 8:57 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Btw, we can have an html "type", which will strip the content and store
it as is (and index it as well).


(Administrator-2) #8

Definitely; while .NET has some great support for HTML in the form of the HTML Agility Pack (great for stripping documents) it would be great to have ES have intimate knowledge of this document type.

I assume storage of the original document would be provided on top of the parses version?

  • Nick

On Jun 10, 2011, at 7:21 AM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

+1!

It would be good to have some configuration options here however. Having an option to tell the html cleaner which html tags to remove and which to keep when storing the original html field content could be very useful (it can be handy for document preview).

I think that jsoup could be used for this. It has a nice API for cleaning HTML and allows to specify tag set to be remove (can be also customized).

Check
http://jsoup.org/apidocs/org/jsoup/Jsoup.html#clean(java.lang.String, org.jsoup.safety.Whitelist)
http://jsoup.org/apidocs/org/jsoup/safety/Whitelist.html

Just my cents.
Lukas

On Fri, Jun 10, 2011 at 12:06 PM, Karel Minarik karel.minarik@gmail.com wrote:
That's a great idea, I've talked to many people who would seriously
enjoy this.

On Jun 9, 8:57 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Btw, we can have an html "type", which will strip the content and store it as is (and index it as well).


(Shay Banon) #9

Make sense, lets open an issue so we can keep track of this. It should be pretty simple to add an html type (even as a plugin, similar to the attachments one). If someone is up for the challenge, I am here to help!

On Friday, June 10, 2011 at 2:29 PM, administrator wrote:

Definitely; while .NET has some great support for HTML in the form of the HTML Agility Pack (great for stripping documents) it would be great to have ES have intimate knowledge of this document type.

I assume storage of the original document would be provided on top of the parses version?

  • Nick

On Jun 10, 2011, at 7:21 AM, Lukáš Vlček <lukas.vlcek@gmail.com (mailto:lukas.vlcek@gmail.com)> wrote:

+1!

It would be good to have some configuration options here however. Having an option to tell the html cleaner which html tags to remove and which to keep when storing the original html field content could be very useful (it can be handy for document preview).

I think that jsoup could be used for this. It has a nice API for cleaning HTML and allows to specify tag set to be remove (can be also customized).

Check
http://jsoup.org/apidocs/org/jsoup/Jsoup.html#clean(java.lang.String, org.jsoup.safety.Whitelist)
http://jsoup.org/apidocs/org/jsoup/safety/Whitelist.html

Just my cents.
Lukas

On Fri, Jun 10, 2011 at 12:06 PM, Karel Minarik <karel.minarik@gmail.com (mailto:karel.minarik@gmail.com)> wrote:

That's a great idea, I've talked to many people who would seriously
enjoy this.

On Jun 9, 8:57 pm, Shay Banon <shay.ba...@elasticsearch.com (mailto:shay.ba...@elasticsearch.com)> wrote:

Btw, we can have an html "type", which will strip the content and store it as is (and index it as well).


(Lukáš Vlček) #10

Just created a ticket for it:

On Sat, Jun 11, 2011 at 12:32 AM, Shay Banon
shay.banon@elasticsearch.comwrote:

Make sense, lets open an issue so we can keep track of this. It should be
pretty simple to add an html type (even as a plugin, similar to the
attachments one). If someone is up for the challenge, I am here to help!

On Friday, June 10, 2011 at 2:29 PM, administrator wrote:

Definitely; while .NET has some great support for HTML in the form of the
HTML Agility Pack (great for stripping documents) it would be great to have
ES have intimate knowledge of this document type.

I assume storage of the original document would be provided on top of the
parses version?

  • Nick

On Jun 10, 2011, at 7:21 AM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

+1!

It would be good to have some configuration options here however. Having an
option to tell the html cleaner which html tags to remove and which to keep
when storing the original html field content could be very useful (it can be
handy for document preview).

I think that jsoup could be used for this. It has a nice API for cleaning
HTML and allows to specify tag set to be remove (can be also customized).

Check

http://jsoup.org/apidocs/org/jsoup/Jsoup.html#clean(java.lang.String,%20org.jsoup.safety.Whitelist)
http://jsoup.org/apidocs/org/jsoup/Jsoup.html#clean(java.lang.String,
org.jsoup.safety.Whitelist)
http://jsoup.org/apidocs/org/jsoup/safety/Whitelist.html
http://jsoup.org/apidocs/org/jsoup/safety/Whitelist.html

Just my cents.
Lukas

On Fri, Jun 10, 2011 at 12:06 PM, Karel Minarik <karel.minarik@gmail.com
karel.minarik@gmail.com> wrote:

That's a great idea, I've talked to many people who would seriously
enjoy this.

On Jun 9, 8:57 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Btw, we can have an html "type", which will strip the content and store
it as is (and index it as well).


(system) #11