How to use html_strip Char filter?

Mauricio_Alarcon · October 26, 2011, 6:57pm

Guys, I'm in need of remove all html from a specific field on my
documents corpus. I based my configuration on
http://www.elasticsearch.org/guide/reference/index-modules/analysis/custom-analyzer.html

and ended with this inside my elasticsearch.yml

11 index :
12 analysis :
13 analyzer:
14 descriptionAnalyzer:
15 type: custom
16 tokenizer: standard
17 filter: standard
18 char_filter: html_strip

And in the mappings I pointed the field that I wanted to this
analyzer
"description" : { "type" : "string", "index" : "analyzed",
"analyzer" : "descriptionAnalyzer" }

I confirmed that it was used after few indexed docs

 "description" : {
      "analyzer" : "descriptionAnalyzer",
      "type" : "string"
    },

But I'm still getting html in my search results.

What am I doing wrong?

Cheers

~M

kimchy · October 26, 2011, 8:48pm

What do you mean that you get HTML in your search results? You get them as
part of the _source? If so, then it makes sense, since the _source is just
the document you indexed.

On Wed, Oct 26, 2011 at 8:57 PM, maverick mauricio.alarcon@gmail.comwrote:

Guys, I'm in need of remove all html from a specific field on my
documents corpus. I based my configuration on

Elasticsearch Platform — Find real-time answers at scale | Elastic

and ended with this inside my elasticsearch.yml

11 index :
12 analysis :
13 analyzer:
14 descriptionAnalyzer:
15 type: custom
16 tokenizer: standard
17 filter: standard
18 char_filter: html_strip

And in the mappings I pointed the field that I wanted to this
analyzer
"description" : { "type" : "string", "index" : "analyzed",
"analyzer" : "descriptionAnalyzer" }

I confirmed that it was used after few indexed docs
"description" : {
     "analyzer" : "descriptionAnalyzer",
     "type" : "string"
   },
But I'm still getting html in my search results.

What am I doing wrong?

Cheers

~M

phobos182 · October 27, 2011, 1:33pm

I submitted something like this a few months back. The HTML script character filter just removes the items from the index, but not from the stored _source / value.

We use JSoup to remove HTML entries before indexing on the client side.

Mauricio_Alarcon · October 27, 2011, 1:40pm

Thanks Shay,

I spoke too soon, seems that I had a brain fart. After posting the
message I started playing with the analyzer and indeed the analyzer
does its job just right. I gisted it here for the record

gist.github.com

https://gist.github.com/mauricioalarcon/1319548

html_strip test

# Delete previous just in case
curl -XDELETE localhost:9200/facettests/facettest/

# Set the mapping for out facet
curl -XPUT localhost:9200/facettests/facettest/_mapping -d '{
"facettest" : {
    "properties" : {
		"employmentHistory" : {
					"properties" : {
						"name" : {

This file has been truncated. show original

For a moment I thought that playing with the analyzer + setting
store=yes will also get rid off the html on my source, which was a
simple dirty way to remove unwanted formatting for my view. But indeed
this doesn't make any sense

Sorry for the false alarm

Cheers

~M

On Oct 26, 4:48 pm, Shay Banon kim...@gmail.com wrote:

What do you mean that you get HTML in your search results? You get them as
part of the _source? If so, then it makes sense, since the _source is just
the document you indexed.

On Wed, Oct 26, 2011 at 8:57 PM, maverick mauricio.alar...@gmail.comwrote:

Guys, I'm in need of remove all html from a specific field on my
documents corpus. I based my configuration on

Elasticsearch Platform — Find real-time answers at scale | Elastic...

and ended with this inside my elasticsearch.yml

11 index :
12 analysis :
13 analyzer:
14 descriptionAnalyzer:
15 type: custom
16 tokenizer: standard
17 filter: standard
18 char_filter: html_strip

And in the mappings I pointed the field that I wanted to this
analyzer
"description" : { "type" : "string", "index" : "analyzed",
"analyzer" : "descriptionAnalyzer" }

I confirmed that it was used after few indexed docs
"description" : {
     "analyzer" : "descriptionAnalyzer",
     "type" : "string"
   },
But I'm still getting html in my search results.

What am I doing wrong?

Cheers

~M

Mauricio_Alarcon · October 27, 2011, 1:41pm

Thanks for the tip phobos182 this is just exactly what I was looking
for

~M

On Oct 27, 9:33 am, phobos182 phobos...@gmail.com wrote:

I submitted something like this a few months back. The HTML script character
filter just removes the items from the index, but not from the stored
_source / value.

We use JSoup to remove HTML entries before indexing on the client side.

--
View this message in context:http://elasticsearch-users.115913.n3.nabble.com/How-to-use-html-strip...
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

Topic		Replies	Views
HTML Filter - How do I use it in a search? Elasticsearch	5	570	March 16, 2018
How to get char_filter to work? Elasticsearch	14	1150	July 6, 2017
Strip_HTML on indexing does not store results? Elasticsearch	10	918	July 6, 2017
Adding html_strip filter Elasticsearch	6	331	December 27, 2022
Help stripping HTML tags Elasticsearch	6	592	July 6, 2017

How to use html_strip Char filter?

Related topics