Strip_html

Hi!

I'm using html_strip for striping all HTML tags from some fields. But when I
search messages with "img" in its body then ES finds messages with tag
too.
Where I'm wrong?

There is the query:

{'query': {
'filtered': {
'filter': {'term': {'owner_id': '4d07646affc84a6717000011'}
},
'query': {'query_string': {'query': u'img*'}}
}
}

Here is part of my ES config:

index:
analysis:
analyzer:
message_content:
tokenizer: standard
char_filter: [html_strip]
read_ahead: 1024

Part of the mapping of index "messages":

    "message": {
        "_source" : {"enabled": false},
        "properties" : {
            "body": {
                 "type": "string",
                 "index_analyzer": "message_content"
            }
       }
  }

--
Andrew Degtiariov
DA-RIPE

I'm afraid I'm not writing with a solution but a related question...

My reading of the ES manual suggests that the analyzer will run html_strip by default -- in other words, there's no need to declare it in the configuration. But am I correct? See below:

http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/charfilter/

Hi Andrew

I'm using html_strip for striping all HTML tags from some fields. But
when I search messages with "img" in its body then ES finds messages
with tag too.
Where I'm wrong?

I created a small HTML strip test here

This creates an index with two custom analyzers:

test_1: {
"tokenizer" : "standard",
"char_filter" : ["html_strip"]
}

test_2: {
"tokenizer" : "standard",
"char_filter" : ["html_strip"],
"filter" : ["standard","lowercase","stop","asciifolding"],
}

Then I use the analyze call to compare the results of the standard,
test_1 and test_2 analyzers when indexing the text:

"the quick bröwn "jumped""

The results show that this is working correctly.

Are you sure that you added your mapping with the message_content
analyzer before you indexed all of your docs?

Can you create a curl recreation of the issue that you are seeing?

thanks

Clint

Hiya

On Fri, 2011-01-14 at 14:36 -0800, searchersteve wrote:

I'm afraid I'm not writing with a solution but a related question...

My reading of the ES manual suggests that the analyzer will run html_strip
by default -- in other words, there's no need to declare it in the
configuration. But am I correct? See below:

No, it is available by default, but not enabled by default.

See HTML Strip charfilter test for ElasticSearch · GitHub for an example of how to use it.

clint