Problem with standard_html_strip


(M) #1

Hi,

I'd like to index words from HTML source. I'm having same problem with
this ( https://gist.github.com/1478233 )
It appears that ElasticSearch indexes HTML tags along with words.

(copy from the URL)
$ curl -XPUT http://localhost:9200/foo

{"ok":true,"acknowledged":true}

$ curl -XPUT http://localhost:9200/foo/bar/_mapping -d '{
"properties" : {
"body": {"type":"string", "analyzer":"standard_html_strip"}
}
}'

{"ok":true,"acknowledged":true}

$ curl -XPUT http://localhost:9200/foo/bar/1 -d '{
"body": "

This is a test.

"
}'

{"ok":true,"_index":"foo","_type":"bar","_id":"1","_version":1}

$ curl -XGET http://localhost:9200/foo/bar/_search?q=body:strong

{"took":35,"timed_out":false,"_shards":{"total":5,"successful":
5,"failed":0},"hits":{"total":1,"max_score":0.18985549,"hits":
[{"_index":"foo","_type":"bar","_id":"1","_score":0.18985549,
"_source" : {
"body": "

This is a test.

"
}}]}}

I assume that the last query should not yield anything because
"strong" is a tag name. Am I doing something wrong or is it just a
Lucene bug?


(Alexander Reelsen) #2

Hi

there are two bugs in your configuration. You are treating the html_strip
filter as an analyzer, which does not work and you are indexing the mapping
wrong.

Put this in your elasticsearch.yml:

index:
analysis:
analyzer:
default:
type: standard

  strip_html_analyzer:
    type: custom
    tokenizer: standard
    filter: [standard]
    char_filter: html_strip

Then correct setting the mapping (you need to add the type as well as
setting the right analyzer:)

curl -XPUT http://localhost:9200/foo/bar/_mapping -d '{
"bar":
{ "properties" : {
"body": {"type":"string", "analyzer":"strip_html_analyzer" }
}
}
}'

By checking the mapping with a GET on the above URL you will also see that
it is empty with your current configuration, so the mapping was not applied
(without getting back an error message, which is not too nice..)

Now index your document again and start searching for "strong" and "test"
and it should work.

--Alexander


(M) #3

Hi Alex,

html_strip is a char filter and I was mentioning standard_html_strip which
I believe it is an analyzer with html_strip.

Anyway, I tried your suggestion.

$ curl -XPUT http://localhost:9200/foo/bar/_mapping -d '{
"bar":
{ "properties" : {
"body": {"type":"string", "analyzer":"strip_html_analyzer" }
}
}
}'
{"ok":true,"acknowledged":true}

$ curl -XPUT localhost:9200/foo/bar/1 -d '{

"body" : "

hello world

there it is

"
}
'
{"ok":true,"_index":"foo","_type":"bar","_id":"1","_version":1}

$ curl -XGET localhost:9200/foo/bar/_search?q=color
{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.06780553,"hits":[{"_index":"foo","_type":"bar","_id":"1","_score":0.06780553,
"_source" : {
"body" : "

hello world

there it is

"
}
}]}}

Calling to _analyze handler directly seems to be working fine.

$ curl -XPOST localhost:9200/foo/_analyze?analyzer=strip_html_analyzer -d
'

hello world
'
{"tokens":[{"token":"hello","start_offset":17,"end_offset":22,"type":"","position":1},{"token":"world","start_offset":23,"end_offset":28,"type":"","position":2}]}

The list of token does not include HTML tags.

Am I still doing something wrong?

Thanks,

2012년 6월 24일 일요일 오전 4시 32분 25초 UTC-4, Alexander Reelsen 님의 말:

Hi

there are two bugs in your configuration. You are treating the html_strip
filter as an analyzer, which does not work and you are indexing the mapping
wrong.

Put this in your elasticsearch.yml:

index:
analysis:
analyzer:
default:
type: standard

  strip_html_analyzer:
    type: custom
    tokenizer: standard
    filter: [standard]
    char_filter: html_strip

Then correct setting the mapping (you need to add the type as well as
setting the right analyzer:)

curl -XPUT http://localhost:9200/foo/bar/_mapping -d '{
"bar":
{ "properties" : {
"body": {"type":"string", "analyzer":"strip_html_analyzer" }
}
}
}'

By checking the mapping with a GET on the above URL you will also see that
it is empty with your current configuration, so the mapping was not applied
(without getting back an error message, which is not too nice..)

Now index your document again and start searching for "strong" and "test"
and it should work.

--Alexander


(Igor Motov) #4

Your search query is searching the _all fieldhttp://www.elasticsearch.org/guide/reference/mapping/all-field.html,
that includes content of all fields but is indexed using default analyzer.
To search only the body field and use body-specific analyzer, you need to
specify this field in your query:

curl -XGET "localhost:9200/foo/bar/_search?q=body:color"

On Sunday, June 24, 2012 2:20:29 PM UTC-4, M wrote:

Hi Alex,

html_strip is a char filter and I was mentioning standard_html_strip which
I believe it is an analyzer with html_strip.

Anyway, I tried your suggestion.

$ curl -XPUT http://localhost:9200/foo/bar/_mapping -d '{
"bar":
{ "properties" : {
"body": {"type":"string", "analyzer":"strip_html_analyzer" }
}
}
}'
{"ok":true,"acknowledged":true}

$ curl -XPUT localhost:9200/foo/bar/1 -d '{

"body" : "

hello world

there it is

"
}
'
{"ok":true,"_index":"foo","_type":"bar","_id":"1","_version":1}

$ curl -XGET localhost:9200/foo/bar/_search?q=color
{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.06780553,"hits":[{"_index":"foo","_type":"bar","_id":"1","_score":0.06780553,
"_source" : {
"body" : "

hello world

there it is

"
}
}]}}

Calling to _analyze handler directly seems to be working fine.

$ curl -XPOST localhost:9200/foo/_analyze?analyzer=strip_html_analyzer -d
'

hello world
'

{"tokens":[{"token":"hello","start_offset":17,"end_offset":22,"type":"","position":1},{"token":"world","start_offset":23,"end_offset":28,"type":"","position":2}]}

The list of token does not include HTML tags.

Am I still doing something wrong?

Thanks,

2012년 6월 24일 일요일 오전 4시 32분 25초 UTC-4, Alexander Reelsen 님의 말:

Hi

there are two bugs in your configuration. You are treating the html_strip
filter as an analyzer, which does not work and you are indexing the mapping
wrong.

Put this in your elasticsearch.yml:

index:
analysis:
analyzer:
default:
type: standard

  strip_html_analyzer:
    type: custom
    tokenizer: standard
    filter: [standard]
    char_filter: html_strip

Then correct setting the mapping (you need to add the type as well as
setting the right analyzer:)

curl -XPUT http://localhost:9200/foo/bar/_mapping -d '{
"bar":
{ "properties" : {
"body": {"type":"string", "analyzer":"strip_html_analyzer" }
}
}
}'

By checking the mapping with a GET on the above URL you will also see
that it is empty with your current configuration, so the mapping was not
applied (without getting back an error message, which is not too nice..)

Now index your document again and start searching for "strong" and "test"
and it should work.

--Alexander


(system) #5