Strip_html

Andrew_Degtiariov · January 14, 2011, 9:58am

Hi!

I'm using html_strip for striping all HTML tags from some fields. But when I
search messages with "img" in its body then ES finds messages with tag
too.
Where I'm wrong?

There is the query:

{'query': {
'filtered': {
'filter': {'term': {'owner_id': '4d07646affc84a6717000011'}
},
'query': {'query_string': {'query': u'img*'}}
}
}

Here is part of my ES config:

index:
analysis:
analyzer:
message_content:
tokenizer: standard
char_filter: [html_strip]
read_ahead: 1024

Part of the mapping of index "messages":

    "message": {
        "_source" : {"enabled": false},
        "properties" : {
            "body": {
                 "type": "string",
                 "index_analyzer": "message_content"
            }
       }
  }

--
Andrew Degtiariov
DA-RIPE

searchersteve · January 14, 2011, 10:36pm

I'm afraid I'm not writing with a solution but a related question...

My reading of the ES manual suggests that the analyzer will run html_strip by default -- in other words, there's no need to declare it in the configuration. But am I correct? See below:

http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/charfilter/

Clinton_Gormley · January 15, 2011, 1:11pm

Hi Andrew

I'm using html_strip for striping all HTML tags from some fields. But
when I search messages with "img" in its body then ES finds messages
with tag too.
Where I'm wrong?

I created a small HTML strip test here

gist.github.com

https://gist.github.com/clintongormley/780895

html_strip.sh

# Analyze text: "the <b>quick</b> bröwn <img src="fox"/> &quot;jumped&quot;"

curl -XPUT 'http://127.0.0.1:9200/foo/'  -d '
{
   "index" : {
      "analysis" : {
         "analyzer" : {
            "test_1" : {
               "char_filter" : [
                  "html_strip"

This file has been truncated. show original

This creates an index with two custom analyzers:

test_1: {
"tokenizer" : "standard",
"char_filter" : ["html_strip"]
}

test_2: {
"tokenizer" : "standard",
"char_filter" : ["html_strip"],
"filter" : ["standard","lowercase","stop","asciifolding"],
}

Then I use the analyze call to compare the results of the standard,
test_1 and test_2 analyzers when indexing the text:

"the quick brÃ¶wn "jumped""

The results show that this is working correctly.

Are you sure that you added your mapping with the message_content
analyzer before you indexed all of your docs?

Can you create a curl recreation of the issue that you are seeing?

thanks

Clint

Clinton_Gormley · January 15, 2011, 1:11pm

Hiya

On Fri, 2011-01-14 at 14:36 -0800, searchersteve wrote:

I'm afraid I'm not writing with a solution but a related question...

My reading of the ES manual suggests that the analyzer will run html_strip
by default -- in other words, there's no need to declare it in the
configuration. But am I correct? See below:

No, it is available by default, but not enabled by default.

See HTML Strip charfilter test for ElasticSearch · GitHub for an example of how to use it.

clint

Topic		Replies	Views
Adding html_strip filter Elasticsearch	6	331	December 27, 2022
How to get char_filter to work? Elasticsearch	14	1150	July 6, 2017
How to use html_strip Char filter? Elasticsearch	5	1842	July 6, 2017
Strip_HTML on indexing does not store results? Elasticsearch	10	918	July 6, 2017
Help stripping HTML tags Elasticsearch	6	592	July 6, 2017

Strip_html

Related topics