Simple question about html stripping


(Evgeniy Galkin) #1

Hello.
I'm trying to begin use the elasticsearch.
So, i have my custom analyzer with next config (part of the
elasticsearch.yml):

analyzer :
russian_html :
type : custom
tokenizer : standard
filter : [standard, lowercase, rus_stem]
char_filter : [html_strip]
filter :
rus_stem :
type : stemmer
name : russian

Search is working ok, except next thing:
Next query - {"query":{"text":{"body":"style"}}} - returns result in
which word "style" is part of (for example)

some
things

, although in this case the keyword is an html attribute. Is
it normal behaviour or it's a bug/my fault in configuration?

(Shay Banon) #2

It should be stripped, can you test it with a sample doc and using the
analyze API endpoint? See if its stripeed.

On Fri, Apr 6, 2012 at 6:24 PM, Evgeniy Galkin evgeniy@parkflyer.ru wrote:

Hello.
I'm trying to begin use the elasticsearch.
So, i have my custom analyzer with next config (part of the
elasticsearch.yml):

analyzer :
russian_html :
type : custom
tokenizer : standard
filter : [standard, lowercase, rus_stem]
char_filter : [html_strip]
filter :
rus_stem :
type : stemmer
name : russian

Search is working ok, except next thing:
Next query - {"query":{"text":{"body":"style"}}} - returns result in
which word "style" is part of (for example)

some
things

, although in this case the keyword is an html attribute. Is
it normal behaviour or it's a bug/my fault in configuration?

(Evgeniy Galkin) #3

Thank you for hint about analyze API.
Probably, I've found a bug.
The problem appears when html tag contains an attribute which starts
with '_' sign (underscore).
Example:
$ curl -XGET http://localhost:9200/comments/_analyze?field=comment.body
-d "<span style="color: rgb(0, 0, 0); background-color: transparent;
" sometag="someval">"
{"tokens":[]}

$ curl -XGET http://localhost:9200/comments/_analyze?field=comment.body
-d "<span style="color: rgb(0, 0, 0); background-color: transparent;
" _sometag="someval">"
{"tokens":[{"token":"span","start_offset":1,"end_offset":
5,"type":"","position":1},{"token":"style","start_offset":
6,"end_offset":11,"type":"","position":2},
{"token":"color","start_offset":13,"end_offset":
18,"type":"","position":3},{"token":"rgb","start_offset":
20,"end_offset":23,"type":"","position":4},
{"token":"0","start_offset":24,"end_offset":
25,"type":"","position":5},{"token":"0","start_offset":
27,"end_offset":28,"type":"","position":6},
{"token":"0","start_offset":30,"end_offset":
31,"type":"","position":7},{"token":"background","start_offset":
34,"end_offset":44,"type":"","position":8},
{"token":"color","start_offset":45,"end_offset":
50,"type":"","position":9},
{"token":"transparent","start_offset":52,"end_offset":
63,"type":"","position":10},
{"token":"_sometag","start_offset":66,"end_offset":
74,"type":"","position":11},
{"token":"someval","start_offset":76,"end_offset":
83,"type":"","position":12}]}

Underscore sign in '_sometag' is the only difference.

So, should i create bug report on github?

On 9 апр, 00:18, Shay Banon kim...@gmail.com wrote:

It should be stripped, can you test it with a sample doc and using the
analyze API endpoint? See if its stripeed.

On Fri, Apr 6, 2012 at 6:24 PM, Evgeniy Galkin evge...@parkflyer.ru wrote:

Hello.
I'm trying to begin use the elasticsearch.
So, i have my custom analyzer with next config (part of the
elasticsearch.yml):

analyzer :
russian_html :
type : custom
tokenizer : standard
filter : [standard, lowercase, rus_stem]
char_filter : [html_strip]
filter :
rus_stem :
type : stemmer
name : russian

Search is working ok, except next thing:
Next query - {"query":{"text":{"body":"style"}}} - returns result in
which word "style" is part of (for example)

some
things

, although in this case the keyword is an html attribute. Is
it normal behaviour or it's a bug/my fault in configuration?

(Shay Banon) #4

The html stripping is part of Lucene, so the bug is probably there. I can
try and chase it and open a recreation on the Lucene level.

On Mon, Apr 9, 2012 at 9:37 AM, Evgeniy Galkin evgeniy@parkflyer.ru wrote:

Thank you for hint about analyze API.
Probably, I've found a bug.
The problem appears when html tag contains an attribute which starts
with '_' sign (underscore).
Example:
$ curl -XGET http://localhost:9200/comments/_analyze?field=comment.body
-d "<span style="color: rgb(0, 0, 0); background-color: transparent;
" sometag="someval">"
{"tokens":[]}

$ curl -XGET http://localhost:9200/comments/_analyze?field=comment.body
-d "<span style="color: rgb(0, 0, 0); background-color: transparent;
" _sometag="someval">"
{"tokens":[{"token":"span","start_offset":1,"end_offset":
5,"type":"","position":1},{"token":"style","start_offset":
6,"end_offset":11,"type":"","position":2},
{"token":"color","start_offset":13,"end_offset":
18,"type":"","position":3},{"token":"rgb","start_offset":
20,"end_offset":23,"type":"","position":4},
{"token":"0","start_offset":24,"end_offset":
25,"type":"","position":5},{"token":"0","start_offset":
27,"end_offset":28,"type":"","position":6},
{"token":"0","start_offset":30,"end_offset":
31,"type":"","position":7},{"token":"background","start_offset":
34,"end_offset":44,"type":"","position":8},
{"token":"color","start_offset":45,"end_offset":
50,"type":"","position":9},
{"token":"transparent","start_offset":52,"end_offset":
63,"type":"","position":10},
{"token":"_sometag","start_offset":66,"end_offset":
74,"type":"","position":11},
{"token":"someval","start_offset":76,"end_offset":
83,"type":"","position":12}]}

Underscore sign in '_sometag' is the only difference.

So, should i create bug report on github?

On 9 апр, 00:18, Shay Banon kim...@gmail.com wrote:

It should be stripped, can you test it with a sample doc and using the
analyze API endpoint? See if its stripeed.

On Fri, Apr 6, 2012 at 6:24 PM, Evgeniy Galkin evge...@parkflyer.ru
wrote:

Hello.
I'm trying to begin use the elasticsearch.
So, i have my custom analyzer with next config (part of the
elasticsearch.yml):

analyzer :
russian_html :
type : custom
tokenizer : standard
filter : [standard, lowercase, rus_stem]
char_filter : [html_strip]
filter :
rus_stem :
type : stemmer
name : russian

Search is working ok, except next thing:
Next query - {"query":{"text":{"body":"style"}}} - returns result in
which word "style" is part of (for example)

some
things

, although in this case the keyword is an html attribute. Is
it normal behaviour or it's a bug/my fault in configuration?

(system) #5