Simple question about html stripping

Evgeniy_Galkin · April 6, 2012, 3:24pm

Hello.
I'm trying to begin use the elasticsearch.
So, i have my custom analyzer with next config (part of the
elasticsearch.yml):

analyzer :
russian_html :
type : custom
tokenizer : standard
filter : [standard, lowercase, rus_stem]
char_filter : [html_strip]
filter :
rus_stem :
type : stemmer
name : russian

Search is working ok, except next thing:
Next query - {"query":{"text":{"body":"style"}}} - returns result in
which word "style" is part of (for example)

some
things

, although in this case the keyword is an html attribute. Is
it normal behaviour or it's a bug/my fault in configuration?

kimchy · April 8, 2012, 6:18pm

It should be stripped, can you test it with a sample doc and using the
analyze API endpoint? See if its stripeed.

On Fri, Apr 6, 2012 at 6:24 PM, Evgeniy Galkin evgeniy@parkflyer.ru wrote:

Hello.
I'm trying to begin use the elasticsearch.
So, i have my custom analyzer with next config (part of the
elasticsearch.yml):

analyzer :
russian_html :
type : custom
tokenizer : standard
filter : [standard, lowercase, rus_stem]
char_filter : [html_strip]
filter :
rus_stem :
type : stemmer
name : russian

Search is working ok, except next thing:
Next query - {"query":{"text":{"body":"style"}}} - returns result in
which word "style" is part of (for example)
some
things
, although in this case the keyword is an html attribute. Is
it normal behaviour or it's a bug/my fault in configuration?

Evgeniy_Galkin · April 9, 2012, 6:37am

Thank you for hint about analyze API.
Probably, I've found a bug.
The problem appears when html tag contains an attribute which starts
with '_' sign (underscore).
Example:
$ curl -XGET http://localhost:9200/comments/_analyze?field=comment.body
-d "<span style="color: rgb(0, 0, 0); background-color: transparent;
" sometag="someval">"
{"tokens":}

$ curl -XGET http://localhost:9200/comments/_analyze?field=comment.body
-d "<span style="color: rgb(0, 0, 0); background-color: transparent;
" _sometag="someval">"
{"tokens":[{"token":"span","start_offset":1,"end_offset":
5,"type":"","position":1},{"token":"style","start_offset":
6,"end_offset":11,"type":"","position":2},
{"token":"color","start_offset":13,"end_offset":
18,"type":"","position":3},{"token":"rgb","start_offset":
20,"end_offset":23,"type":"","position":4},
{"token":"0","start_offset":24,"end_offset":
25,"type":"","position":5},{"token":"0","start_offset":
27,"end_offset":28,"type":"","position":6},
{"token":"0","start_offset":30,"end_offset":
31,"type":"","position":7},{"token":"background","start_offset":
34,"end_offset":44,"type":"","position":8},
{"token":"color","start_offset":45,"end_offset":
50,"type":"","position":9},
{"token":"transparent","start_offset":52,"end_offset":
63,"type":"","position":10},
{"token":"_sometag","start_offset":66,"end_offset":
74,"type":"","position":11},
{"token":"someval","start_offset":76,"end_offset":
83,"type":"","position":12}]}

Underscore sign in '_sometag' is the only difference.

So, should i create bug report on github?

On 9 апр, 00:18, Shay Banon kim...@gmail.com wrote:

It should be stripped, can you test it with a sample doc and using the
analyze API endpoint? See if its stripeed.

On Fri, Apr 6, 2012 at 6:24 PM, Evgeniy Galkin evge...@parkflyer.ru wrote:

Hello.
I'm trying to begin use the elasticsearch.
So, i have my custom analyzer with next config (part of the
elasticsearch.yml):

analyzer :
russian_html :
type : custom
tokenizer : standard
filter : [standard, lowercase, rus_stem]
char_filter : [html_strip]
filter :
rus_stem :
type : stemmer
name : russian

Search is working ok, except next thing:
Next query - {"query":{"text":{"body":"style"}}} - returns result in
which word "style" is part of (for example)
some
things
, although in this case the keyword is an html attribute. Is
it normal behaviour or it's a bug/my fault in configuration?

kimchy · April 11, 2012, 9:28am

The html stripping is part of Lucene, so the bug is probably there. I can
try and chase it and open a recreation on the Lucene level.

On Mon, Apr 9, 2012 at 9:37 AM, Evgeniy Galkin evgeniy@parkflyer.ru wrote:

Thank you for hint about analyze API.
Probably, I've found a bug.
The problem appears when html tag contains an attribute which starts
with '_' sign (underscore).
Example:
$ curl -XGET http://localhost:9200/comments/_analyze?field=comment.body
-d "<span style="color: rgb(0, 0, 0); background-color: transparent;
" sometag="someval">"
{"tokens":}

$ curl -XGET http://localhost:9200/comments/_analyze?field=comment.body
-d "<span style="color: rgb(0, 0, 0); background-color: transparent;
" _sometag="someval">"
{"tokens":[{"token":"span","start_offset":1,"end_offset":
5,"type":"","position":1},{"token":"style","start_offset":
6,"end_offset":11,"type":"","position":2},
{"token":"color","start_offset":13,"end_offset":
18,"type":"","position":3},{"token":"rgb","start_offset":
20,"end_offset":23,"type":"","position":4},
{"token":"0","start_offset":24,"end_offset":
25,"type":"","position":5},{"token":"0","start_offset":
27,"end_offset":28,"type":"","position":6},
{"token":"0","start_offset":30,"end_offset":
31,"type":"","position":7},{"token":"background","start_offset":
34,"end_offset":44,"type":"","position":8},
{"token":"color","start_offset":45,"end_offset":
50,"type":"","position":9},
{"token":"transparent","start_offset":52,"end_offset":
63,"type":"","position":10},
{"token":"_sometag","start_offset":66,"end_offset":
74,"type":"","position":11},
{"token":"someval","start_offset":76,"end_offset":
83,"type":"","position":12}]}

Underscore sign in '_sometag' is the only difference.

So, should i create bug report on github?

On 9 апр, 00:18, Shay Banon kim...@gmail.com wrote:

It should be stripped, can you test it with a sample doc and using the
analyze API endpoint? See if its stripeed.

On Fri, Apr 6, 2012 at 6:24 PM, Evgeniy Galkin evge...@parkflyer.ru
wrote:

Hello.
I'm trying to begin use the elasticsearch.
So, i have my custom analyzer with next config (part of the
elasticsearch.yml):

analyzer :
russian_html :
type : custom
tokenizer : standard
filter : [standard, lowercase, rus_stem]
char_filter : [html_strip]
filter :
rus_stem :
type : stemmer
name : russian

Search is working ok, except next thing:
Next query - {"query":{"text":{"body":"style"}}} - returns result in
which word "style" is part of (for example)
some
things
, although in this case the keyword is an html attribute. Is
it normal behaviour or it's a bug/my fault in configuration?

Topic		Replies	Views
Help stripping HTML tags Elasticsearch	6	592	July 6, 2017
Adding html_strip filter Elasticsearch	6	331	December 27, 2022
Html_strip and lowercase on keyword analyzed fields Elasticsearch	1	655	July 5, 2017
[7.10.2] Querying for "fields" produces HTML entities despite use of "html_strip" Elasticsearch	2	280	September 1, 2021
How to get char_filter to work? Elasticsearch	14	1150	July 6, 2017

Simple question about html stripping

Related topics