How to use html_strip Char filter?


(Mauricio Alarcon) #1

Guys, I'm in need of remove all html from a specific field on my
documents corpus. I based my configuration on
http://www.elasticsearch.org/guide/reference/index-modules/analysis/custom-analyzer.html

and ended with this inside my elasticsearch.yml

11 index :
12 analysis :
13 analyzer:
14 descriptionAnalyzer:
15 type: custom
16 tokenizer: standard
17 filter: standard
18 char_filter: html_strip

And in the mappings I pointed the field that I wanted to this
analyzer
"description" : { "type" : "string", "index" : "analyzed",
"analyzer" : "descriptionAnalyzer" }

I confirmed that it was used after few indexed docs

 "description" : {
      "analyzer" : "descriptionAnalyzer",
      "type" : "string"
    },

But I'm still getting html in my search results.

What am I doing wrong?

Cheers

~M


(Shay Banon) #2

What do you mean that you get HTML in your search results? You get them as
part of the _source? If so, then it makes sense, since the _source is just
the document you indexed.

On Wed, Oct 26, 2011 at 8:57 PM, maverick mauricio.alarcon@gmail.comwrote:

Guys, I'm in need of remove all html from a specific field on my
documents corpus. I based my configuration on

http://www.elasticsearch.org/guide/reference/index-modules/analysis/custom-analyzer.html

and ended with this inside my elasticsearch.yml

11 index :
12 analysis :
13 analyzer:
14 descriptionAnalyzer:
15 type: custom
16 tokenizer: standard
17 filter: standard
18 char_filter: html_strip

And in the mappings I pointed the field that I wanted to this
analyzer
"description" : { "type" : "string", "index" : "analyzed",
"analyzer" : "descriptionAnalyzer" }

I confirmed that it was used after few indexed docs

"description" : {
     "analyzer" : "descriptionAnalyzer",
     "type" : "string"
   },

But I'm still getting html in my search results.

What am I doing wrong?

Cheers

~M


(phobos182) #3

I submitted something like this a few months back. The HTML script character filter just removes the items from the index, but not from the stored _source / value.

We use JSoup to remove HTML entries before indexing on the client side.


(Mauricio Alarcon) #4

Thanks Shay,

I spoke too soon, seems that I had a brain fart. After posting the
message I started playing with the analyzer and indeed the analyzer
does its job just right. I gisted it here for the record

For a moment I thought that playing with the analyzer + setting
store=yes will also get rid off the html on my source, which was a
simple dirty way to remove unwanted formatting for my view. But indeed
this doesn't make any sense

Sorry for the false alarm

Cheers

~M

On Oct 26, 4:48 pm, Shay Banon kim...@gmail.com wrote:

What do you mean that you get HTML in your search results? You get them as
part of the _source? If so, then it makes sense, since the _source is just
the document you indexed.

On Wed, Oct 26, 2011 at 8:57 PM, maverick mauricio.alar...@gmail.comwrote:

Guys, I'm in need of remove all html from a specific field on my
documents corpus. I based my configuration on

http://www.elasticsearch.org/guide/reference/index-modules/analysis/c...

and ended with this inside my elasticsearch.yml

11 index :
12 analysis :
13 analyzer:
14 descriptionAnalyzer:
15 type: custom
16 tokenizer: standard
17 filter: standard
18 char_filter: html_strip

And in the mappings I pointed the field that I wanted to this
analyzer
"description" : { "type" : "string", "index" : "analyzed",
"analyzer" : "descriptionAnalyzer" }

I confirmed that it was used after few indexed docs

"description" : {
     "analyzer" : "descriptionAnalyzer",
     "type" : "string"
   },

But I'm still getting html in my search results.

What am I doing wrong?

Cheers

~M


(Mauricio Alarcon) #5

Thanks for the tip phobos182 this is just exactly what I was looking
for

~M

On Oct 27, 9:33 am, phobos182 phobos...@gmail.com wrote:

I submitted something like this a few months back. The HTML script character
filter just removes the items from the index, but not from the stored
_source / value.

We use JSoup to remove HTML entries before indexing on the client side.

--
View this message in context:http://elasticsearch-users.115913.n3.nabble.com/How-to-use-html-strip...
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(system) #6