Hi!
I'm using html_strip for striping all HTML tags from some fields. But when I
search messages with "img" in its body then ES finds messages with tag
too.
Where I'm wrong?
There is the query:
{'query': {
'filtered': {
'filter': {'term': {'owner_id': '4d07646affc84a6717000011'}
},
'query': {'query_string': {'query': u'img*'}}
}
}
Here is part of my ES config:
index:
analysis:
analyzer:
message_content:
tokenizer: standard
char_filter: [html_strip]
read_ahead: 1024
Part of the mapping of index "messages":
"message": {
"_source" : {"enabled": false},
"properties" : {
"body": {
"type": "string",
"index_analyzer": "message_content"
}
}
}
--
Andrew Degtiariov
DA-RIPE
I'm afraid I'm not writing with a solution but a related question...
My reading of the ES manual suggests that the analyzer will run html_strip by default -- in other words, there's no need to declare it in the configuration. But am I correct? See below:
http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/charfilter/
Hi Andrew
I'm using html_strip for striping all HTML tags from some fields. But
when I search messages with "img" in its body then ES finds messages
with tag too.
Where I'm wrong?
I created a small HTML strip test here
This creates an index with two custom analyzers:
test_1: {
"tokenizer" : "standard",
"char_filter" : ["html_strip"]
}
test_2: {
"tokenizer" : "standard",
"char_filter" : ["html_strip"],
"filter" : ["standard","lowercase","stop","asciifolding"],
}
Then I use the analyze call to compare the results of the standard,
test_1 and test_2 analyzers when indexing the text:
"the quick bröwn "jumped""
The results show that this is working correctly.
Are you sure that you added your mapping with the message_content
analyzer before you indexed all of your docs?
Can you create a curl recreation of the issue that you are seeing?
thanks
Clint
Hiya
On Fri, 2011-01-14 at 14:36 -0800, searchersteve wrote:
I'm afraid I'm not writing with a solution but a related question...
My reading of the ES manual suggests that the analyzer will run html_strip
by default -- in other words, there's no need to declare it in the
configuration. But am I correct? See below:
No, it is available by default, but not enabled by default.
See HTML Strip charfilter test for ElasticSearch · GitHub for an example of how to use it.
clint