Problem with standard_html_strip

M11 · June 24, 2012, 3:44am

Hi,

I'd like to index words from HTML source. I'm having same problem with
this ( https://gist.github.com/1478233 )
It appears that ElasticSearch indexes HTML tags along with words.

(copy from the URL)
$ curl -XPUT http://localhost:9200/foo

{"ok":true,"acknowledged":true}

$ curl -XPUT http://localhost:9200/foo/bar/_mapping -d '{
"properties" : {
"body": {"type":"string", "analyzer":"standard_html_strip"}
}
}'

{"ok":true,"acknowledged":true}

$ curl -XPUT http://localhost:9200/foo/bar/1 -d '{
"body": "

This is a test.

"
}'

{"ok":true,"_index":"foo","_type":"bar","_id":"1","_version":1}

$ curl -XGET http://localhost:9200/foo/bar/_search?q=body:strong

{"took":35,"timed_out":false,"_shards":{"total":5,"successful":
5,"failed":0},"hits":{"total":1,"max_score":0.18985549,"hits":
[{"_index":"foo","_type":"bar","_id":"1","_score":0.18985549,
"_source" : {
"body": "

This is a test.

"
}}]}}

I assume that the last query should not yield anything because
"strong" is a tag name. Am I doing something wrong or is it just a
Lucene bug?

spinscale · June 24, 2012, 8:32am

Hi

there are two bugs in your configuration. You are treating the html_strip
filter as an analyzer, which does not work and you are indexing the mapping
wrong.

Put this in your elasticsearch.yml:

index:
analysis:
analyzer:
default:
type: standard

  strip_html_analyzer:
    type: custom
    tokenizer: standard
    filter: [standard]
    char_filter: html_strip

Then correct setting the mapping (you need to add the type as well as
setting the right analyzer:)

curl -XPUT http://localhost:9200/foo/bar/_mapping -d '{
"bar":
{ "properties" : {
"body": {"type":"string", "analyzer":"strip_html_analyzer" }
}
}
}'

By checking the mapping with a GET on the above URL you will also see that
it is empty with your current configuration, so the mapping was not applied
(without getting back an error message, which is not too nice..)

Now index your document again and start searching for "strong" and "test"
and it should work.

--Alexander

M11 · June 24, 2012, 6:20pm

Hi Alex,

html_strip is a char filter and I was mentioning standard_html_strip which
I believe it is an analyzer with html_strip.

Anyway, I tried your suggestion.

$ curl -XPUT http://localhost:9200/foo/bar/_mapping -d '{
"bar":
{ "properties" : {
"body": {"type":"string", "analyzer":"strip_html_analyzer" }
}
}
}'
{"ok":true,"acknowledged":true}

$ curl -XPUT localhost:9200/foo/bar/1 -d '{

"body" : "
hello world
there it is

"
}
'
{"ok":true,"_index":"foo","_type":"bar","_id":"1","_version":1}

$ curl -XGET localhost:9200/foo/bar/_search?q=color
{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.06780553,"hits":[{"_index":"foo","_type":"bar","_id":"1","_score":0.06780553,
"_source" : {
"body" : "

hello world

there it is

"
}
}]}}

Calling to _analyze handler directly seems to be working fine.

$ curl -XPOST localhost:9200/foo/_analyze?analyzer=strip_html_analyzer -d
'

hello world

'
{"tokens":[{"token":"hello","start_offset":17,"end_offset":22,"type":"","position":1},{"token":"world","start_offset":23,"end_offset":28,"type":"","position":2}]}

The list of token does not include HTML tags.

Am I still doing something wrong?

Thanks,

2012년 6월 24일 일요일 오전 4시 32분 25초 UTC-4, Alexander Reelsen 님의 말:

Hi

there are two bugs in your configuration. You are treating the html_strip
filter as an analyzer, which does not work and you are indexing the mapping
wrong.

Put this in your elasticsearch.yml:

index:
analysis:
analyzer:
default:
type: standard
  strip_html_analyzer:
    type: custom
    tokenizer: standard
    filter: [standard]
    char_filter: html_strip
Then correct setting the mapping (you need to add the type as well as
setting the right analyzer:)

curl -XPUT http://localhost:9200/foo/bar/_mapping -d '{
"bar":
{ "properties" : {
"body": {"type":"string", "analyzer":"strip_html_analyzer" }
}
}
}'

By checking the mapping with a GET on the above URL you will also see that
it is empty with your current configuration, so the mapping was not applied
(without getting back an error message, which is not too nice..)

Now index your document again and start searching for "strong" and "test"
and it should work.

--Alexander

Igor_Motov · June 25, 2012, 7:55pm

Your search query is searching the _all fieldhttp://www.elasticsearch.org/guide/reference/mapping/all-field.html,
that includes content of all fields but is indexed using default analyzer.
To search only the body field and use body-specific analyzer, you need to
specify this field in your query:

curl -XGET "localhost:9200/foo/bar/_search?q=body:color"

On Sunday, June 24, 2012 2:20:29 PM UTC-4, M wrote:

Hi Alex,

html_strip is a char filter and I was mentioning standard_html_strip which
I believe it is an analyzer with html_strip.

Anyway, I tried your suggestion.

$ curl -XPUT http://localhost:9200/foo/bar/_mapping -d '{
"bar":
{ "properties" : {
"body": {"type":"string", "analyzer":"strip_html_analyzer" }
}
}
}'
{"ok":true,"acknowledged":true}

$ curl -XPUT localhost:9200/foo/bar/1 -d '{

"body" : "
hello world
there it is

"
}
'
{"ok":true,"_index":"foo","_type":"bar","_id":"1","_version":1}

$ curl -XGET localhost:9200/foo/bar/_search?q=color
{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.06780553,"hits":[{"_index":"foo","_type":"bar","_id":"1","_score":0.06780553,
"_source" : {
"body" : "
hello world
there it is

"
}
}]}}

Calling to _analyze handler directly seems to be working fine.

$ curl -XPOST localhost:9200/foo/_analyze?analyzer=strip_html_analyzer -d
'
hello world
'

{"tokens":[{"token":"hello","start_offset":17,"end_offset":22,"type":"","position":1},{"token":"world","start_offset":23,"end_offset":28,"type":"","position":2}]}

The list of token does not include HTML tags.

Am I still doing something wrong?

Thanks,

2012년 6월 24일 일요일 오전 4시 32분 25초 UTC-4, Alexander Reelsen 님의 말:
Hi

there are two bugs in your configuration. You are treating the html_strip
filter as an analyzer, which does not work and you are indexing the mapping
wrong.

Put this in your elasticsearch.yml:

index:
analysis:
analyzer:
default:
type: standard
  strip_html_analyzer:
    type: custom
    tokenizer: standard
    filter: [standard]
    char_filter: html_strip
Then correct setting the mapping (you need to add the type as well as
setting the right analyzer:)

curl -XPUT http://localhost:9200/foo/bar/_mapping -d '{
"bar":
{ "properties" : {
"body": {"type":"string", "analyzer":"strip_html_analyzer" }
}
}
}'

By checking the mapping with a GET on the above URL you will also see
that it is empty with your current configuration, so the mapping was not
applied (without getting back an error message, which is not too nice..)

Now index your document again and start searching for "strong" and "test"
and it should work.

--Alexander

Topic		Replies	Views
How to use standard_html_strip Elasticsearch	8	474	July 6, 2017
Strip_HTML on indexing does not store results? Elasticsearch	10	918	July 6, 2017
How to get char_filter to work? Elasticsearch	14	1144	July 6, 2017
Strip_html Elasticsearch	4	701	July 6, 2017
Help stripping HTML tags Elasticsearch	6	588	July 6, 2017

Problem with standard_html_strip

Related topics