Help stripping HTML tags

Greg_Brown · June 1, 2011, 11:32pm

I've configured a default custom analyzer as follows:

index :
analysis :
filter :
snowball :
type : snowball
language : English
wd_filter :
type : word_delimiter
generate_word_parts : true
generate_number_parts : true
catenate_words : true
split_on_case_change : true
preserve_original : true
split_on_numerics: true
analyzer :
default :
type : custom
tokenizer : uax_url_email
filter : [lowercase,snowball,wd_filter]
char_filter : [html_strip]

But when I index a doc into 'content' and then examine the indexed
files with:
curl -XGET 'http://localhost:9200/mgs/p2/385?fields=content&pretty'

I still am seeing all of the html tags in the text. Reading back the
_mapping for the "content" field it is:
"content" : {
"store" : "yes",
"type" : "string"
},
'index' defaults to 'analyzed', so this appears correct.

Further complicating is that if I run:
curl -XGET 'http://localhost:9200/mgs/_analyze' -d '

this is
a tests

'

Then I get:
{"tokens":[{"token":"this","start_offset":3,"end_offset":
7,"type":"","position":1},{"token":"is","start_offset":
11,"end_offset":13,"type":"","position":2},
{"token":"a","start_offset":18,"end_offset":
19,"type":"","position":3},{"token":"test","start_offset":
20,"end_offset":25,"type":"","position":4}]}

Which appears correct as the tags have been stripped out by using the
default analyzer, and the word 'tests' has been stemmed to 'test'.

So the analyzer seems to be working correctly except for when I
actually add a document. I am using Elastica for adding documents,
but I tried

curl -XPUT 'http://localhost:9200/mgs/p2/0' -d '{"content" : "
This is a tests
" }'

Then:

curl -XGET 'http://localhost:9200/mgs/p2/0?fields=content'
Gives:
{"_index":"mgs","_type":"p2","_id":"0","_version":2,"fields":
{"content":"
This is a tests
"}}

So adding the document does not seem to be causing the analyzer to be
run on the added document. Any ideas on what I am missing/doing
wrong?

Thanks
-Greg

kimchy · June 2, 2011, 7:42am

The stored version of a field stores the content as is, and not the analyzed form. The analysis process only controls how the text is broken down into terms and indexed.

On Thursday, June 2, 2011 at 2:32 AM, Greg B wrote:

I've configured a default custom analyzer as follows:

index :
analysis :
filter :
snowball :
type : snowball
language : English
wd__filter :
type : word_delimiter
generate_word_parts : true
generate_number_parts : true
catenate__words : true
split_on_case_change : true
preserve_original : true
split_on_numerics: true
analyzer :
default :
type : custom
tokenizer : uax__url_email
filter : [lowercase,snowball,wd_filter]
char_filter : [html_strip]

But when I index a doc into 'content' and then examine the indexed
files with:
curl -XGET 'http://localhost:9200/mgs/p2/385?fields=content&pretty'

I still am seeing all of the html tags in the text. Reading back the
_mapping for the "content" field it is:
"content" : {
"store" : "yes",
""type" : "string"
},
'index' defaults to 'analyzed', so this appears correct.

Further complicating is that if I run:
curl -XGET 'http://localhost:9200/mgs/_analyze' -d '
this is
a tests
'

Then I get:
{"tokens":[{"token":"this","start_offset":3,"end_offset":
7,"type":"","position":1},{"token":"is","start_offset":
11,"end_offset":13,"type":"","position":2},
{"token":"a","start_offset":18,"end_offset":
19,"type":"","position":3},{"token":"test","start_offset":
20,"end_offset":25,"type":"","position":4}]}

Which appears correct as the tags have been stripped out by using the
default analyzer, and the word 'tests' has been stemmed to 'test'.

So the analyzer seems to be working correctly except for when I
actually add a document. I am using Elastica for adding documents,
but I tried

curl -XPUT 'http://localhost:9200/mgs/p2/0' -d '{"content" : "
This is a tests
" }'

Then:

curl -XGET 'http://localhost:9200/mgs/p2/0?fields=content'
Gives:
{"_index":"mgs","_type":"p2","_id":"0","_version":2,"fields":
{"content":"
This is a tests
"}}

So adding the document does not seem to be causing the analyzer to be
run on the added document. Any ideas on what I am missing/doing
wrong?

Thanks
-Greg

Greg_Brown · June 2, 2011, 3:56pm

Hi Shay, thanks for the fast response.

Is there a way to store the version with the html stripped, or do I
need to implement my own stripping to remove the html tags. The tags
are particularly a problem when trying to do highlighting in my search
results where the extraneous tags rather thoroughly screw up my
results page.

Thanks
-Greg

On Jun 2, 1:42 am, Shay Banon shay.ba...@elasticsearch.com wrote:

The stored version of a field stores the content as is, and not the analyzed form. The analysis process only controls how the text is broken down into terms and indexed.

On Thursday, June 2, 2011 at 2:32 AM, Greg B wrote:

I've configured a default custom analyzer as follows:

index :
analysis :
filter :
snowball :
type : snowball
language : English
wd__filter :
type : word_delimiter
generate_word_parts : true
generate_number_parts : true
catenate__words : true
split_on_case_change : true
preserve_original : true
split_on_numerics: true
analyzer :
default :
type : custom
tokenizer : uax__url_email
filter : [lowercase,snowball,wd_filter]
char_filter : [html_strip]

But when I index a doc into 'content' and then examine the indexed
files with:
curl -XGET 'http://localhost:9200/mgs/p2/385?fields=content&pretty'

I still am seeing all of the html tags in the text. Reading back the
_mapping for the "content" field it is:
"content" : {
"store" : "yes",
""type" : "string"
},
'index' defaults to 'analyzed', so this appears correct.

Further complicating is that if I run:
curl -XGET 'http://localhost:9200/mgs/_analyze'-d '
this is
a tests
'

Then I get:
{"tokens":[{"token":"this","start_offset":3,"end_offset":
7,"type":"","position":1},{"token":"is","start_offset":
11,"end_offset":13,"type":"","position":2},
{"token":"a","start_offset":18,"end_offset":
19,"type":"","position":3},{"token":"test","start_offset":
20,"end_offset":25,"type":"","position":4}]}

Which appears correct as the tags have been stripped out by using the
default analyzer, and the word 'tests' has been stemmed to 'test'.

So the analyzer seems to be working correctly except for when I
actually add a document. I am using Elastica for adding documents,
but I tried

curl -XPUT 'http://localhost:9200/mgs/p2/0'-d '{"content" : "
This is a tests
" }'

Then:

curl -XGET 'http://localhost:9200/mgs/p2/0?fields=content'
Gives:
{"_index":"mgs","_type":"p2","_id":"0","_version":2,"fields":
{"content":"
This is a tests
"}}

So adding the document does not seem to be causing the analyzer to be
run on the added document. Any ideas on what I am missing/doing
wrong?

Thanks
-Greg

kimchy · June 2, 2011, 4:04pm

In this case, you will need to do your own stripping.

On Thursday, June 2, 2011 at 6:56 PM, Greg B wrote:

Hi Shay, thanks for the fast response.

Is there a way to store the version with the html stripped, or do I
need to implement my own stripping to remove the html tags. The tags
are particularly a problem when trying to do highlighting in my search
results where the extraneous tags rather thoroughly screw up my
results page.

Thanks
-Greg

On Jun 2, 1:42 am, Shay Banon <shay.ba...@elasticsearch.com (http://elasticsearch.com)> wrote:

The stored version of a field stores the content as is, and not the analyzed form. The analysis process only controls how the text is broken down into terms and indexed.

On Thursday, June 2, 2011 at 2:32 AM, Greg B wrote:

I've configured a default custom analyzer as follows:

index :
analysis :
filter :
snowball :
type : snowball
language : English
wd__filter :
type : word_delimiter
generate_word_parts : true
generate_number_parts : true
catenate__words : true
split_on_case_change : true
preserve_original : true
split_on_numerics: true
analyzer :
default :
type : custom
tokenizer : uax__url_email
filter : [lowercase,snowball,wd_filter]
char_filter : [html_strip]

But when I index a doc into 'content' and then examine the indexed
files with:
curl -XGET 'http://localhost:9200/mgs/p2/385?fields=content&pretty'

I still am seeing all of the html tags in the text. Reading back the
_mapping for the "content" field it is:
"content" : {
"store" : "yes",
""type" : "string"
},
'index' defaults to 'analyzed', so this appears correct.

Further complicating is that if I run:
curl -XGET 'http://localhost:9200/mgs/_analyze'-d '
this is
a tests
'

Then I get:
{"tokens":[{"token":"this","start_offset":3,"end_offset":
7,"type":"","position":1},{"token":"is","start_offset":
11,"end_offset":13,"type":"","position":2},
{"token":"a","start_offset":18,"end_offset":
19,"type":"","position":3},{"token":"test","start_offset":
20,"end_offset":25,"type":"","position":4}]}

Which appears correct as the tags have been stripped out by using the
default analyzer, and the word 'tests' has been stemmed to 'test'.

So the analyzer seems to be working correctly except for when I
actually add a document. I am using Elastica for adding documents,
but I tried

curl -XPUT 'http://localhost:9200/mgs/p2/0'-d '{"content" : "
This is a tests
" }'

Then:

curl -XGET 'http://localhost:9200/mgs/p2/0?fields=content'
Gives:
{"_index":"mgs","_type":"p2","_id":"0","_version":2,"fields":
{"content":"
This is a tests
"}}

So adding the document does not seem to be causing the analyzer to be
run on the added document. Any ideas on what I am missing/doing
wrong?

Thanks
-Greg

Administrator_2 · June 2, 2011, 4:06pm

Greg,

You have to do the stripping yourself.

Nick

-----Original Message-----
From: Greg B [mailto:gbrown5878@gmail.com]
Sent: Thursday, June 02, 2011 11:56 AM
To: users
Subject: SPAM-LOW: Re: Help stripping HTML tags

Hi Shay, thanks for the fast response.

Is there a way to store the version with the html stripped, or do I
need to implement my own stripping to remove the html tags. The tags
are particularly a problem when trying to do highlighting in my search
results where the extraneous tags rather thoroughly screw up my
results page.

Thanks
-Greg

On Jun 2, 1:42 am, Shay Banon shay.ba...@elasticsearch.com wrote:

The stored version of a field stores the content as is, and not the
analyzed form. The analysis process only controls how the text is broken
down into terms and indexed.

On Thursday, June 2, 2011 at 2:32 AM, Greg B wrote:

I've configured a default custom analyzer as follows:

index :
analysis :
filter :
snowball :
type : snowball
language : English
wd__filter :
type : word_delimiter
generate_word_parts : true
generate_number_parts : true
catenate__words : true
split_on_case_change : true
preserve_original : true
split_on_numerics: true
analyzer :
default :
type : custom
tokenizer : uax__url_email
filter : [lowercase,snowball,wd_filter]
char_filter : [html_strip]

But when I index a doc into 'content' and then examine the indexed
files with:
curl -XGET 'http://localhost:9200/mgs/p2/385?fields=content&pretty'

I still am seeing all of the html tags in the text. Reading back the
_mapping for the "content" field it is:
"content" : {
"store" : "yes",
""type" : "string"
},
'index' defaults to 'analyzed', so this appears correct.

Further complicating is that if I run:
curl -XGET 'http://localhost:9200/mgs/_analyze'-d '
this is
a tests
'

Then I get:
{"tokens":[{"token":"this","start_offset":3,"end_offset":
7,"type":"","position":1},{"token":"is","start_offset":
11,"end_offset":13,"type":"","position":2},
{"token":"a","start_offset":18,"end_offset":
19,"type":"","position":3},{"token":"test","start_offset":
20,"end_offset":25,"type":"","position":4}]}

Which appears correct as the tags have been stripped out by using the
default analyzer, and the word 'tests' has been stemmed to 'test'.

So the analyzer seems to be working correctly except for when I
actually add a document. I am using Elastica for adding documents,
but I tried

curl -XPUT 'http://localhost:9200/mgs/p2/0'-d '{"content" : "
This
is a tests
" }'

Then:

curl -XGET 'http://localhost:9200/mgs/p2/0?fields=content'
Gives:
{"_index":"mgs","_type":"p2","_id":"0","_version":2,"fields":
{"content":"
This is a tests
"}}

So adding the document does not seem to be causing the analyzer to be
run on the added document. Any ideas on what I am missing/doing
wrong?

Thanks
-Greg

Greg_Brown · June 2, 2011, 4:18pm

OK, thanks.
-Greg

On Jun 2, 10:06 am, "Administrator" ad...@sf4answers.com wrote:

Greg,

You have to do the stripping yourself.

Nick

-----Original Message-----
From: Greg B [mailto:gbrown5...@gmail.com]
Sent: Thursday, June 02, 2011 11:56 AM
To: users
Subject: SPAM-LOW: Re: Help stripping HTML tags

Hi Shay, thanks for the fast response.

Is there a way to store the version with the html stripped, or do I
need to implement my own stripping to remove the html tags. The tags
are particularly a problem when trying to do highlighting in my search
results where the extraneous tags rather thoroughly screw up my
results page.

Thanks
-Greg

On Jun 2, 1:42 am, Shay Banon shay.ba...@elasticsearch.com wrote:

The stored version of a field stores the content as is, and not the
analyzed form. The analysis process only controls how the text is broken
down into terms and indexed.

On Thursday, June 2, 2011 at 2:32 AM, Greg B wrote:

I've configured a default custom analyzer as follows:

index :
analysis :
filter :
snowball :
type : snowball
language : English
wd__filter :
type : word_delimiter
generate_word_parts : true
generate_number_parts : true
catenate__words : true
split_on_case_change : true
preserve_original : true
split_on_numerics: true
analyzer :
default :
type : custom
tokenizer : uax__url_email
filter : [lowercase,snowball,wd_filter]
char_filter : [html_strip]

But when I index a doc into 'content' and then examine the indexed
files with:
curl -XGET 'http://localhost:9200/mgs/p2/385?fields=content&pretty'

I still am seeing all of the html tags in the text. Reading back the
_mapping for the "content" field it is:
"content" : {
"store" : "yes",
""type" : "string"
},
'index' defaults to 'analyzed', so this appears correct.

Further complicating is that if I run:
curl -XGET 'http://localhost:9200/mgs/_analyze'-d'
this is
a tests
'

Then I get:
{"tokens":[{"token":"this","start_offset":3,"end_offset":
7,"type":"","position":1},{"token":"is","start_offset":
11,"end_offset":13,"type":"","position":2},
{"token":"a","start_offset":18,"end_offset":
19,"type":"","position":3},{"token":"test","start_offset":
20,"end_offset":25,"type":"","position":4}]}

Which appears correct as the tags have been stripped out by using the
default analyzer, and the word 'tests' has been stemmed to 'test'.

So the analyzer seems to be working correctly except for when I
actually add a document. I am using Elastica for adding documents,
but I tried

curl -XPUT 'http://localhost:9200/mgs/p2/0'-d'{"content" : "
This
is a tests
" }'

Then:

curl -XGET 'http://localhost:9200/mgs/p2/0?fields=content'
Gives:
{"_index":"mgs","_type":"p2","_id":"0","_version":2,"fields":
{"content":"
This is a tests
"}}

So adding the document does not seem to be causing the analyzer to be
run on the added document. Any ideas on what I am missing/doing
wrong?

Thanks
-Greg

Topic		Replies	Views
Simple question about html stripping Elasticsearch	4	374	July 6, 2017
Extend built-in analyzers Elasticsearch	8	1332	July 5, 2018
Skip html tags on indexing Elasticsearch	3	597	July 6, 2017
Adding html_strip filter Elasticsearch	6	320	December 27, 2022
How to use html_strip Char filter? Elasticsearch	5	1833	July 6, 2017

Help stripping HTML tags

Related topics