Help stripping HTML tags


(Greg Brown) #1

I've configured a default custom analyzer as follows:

index :
analysis :
filter :
snowball :
type : snowball
language : English
wd_filter :
type : word_delimiter
generate_word_parts : true
generate_number_parts : true
catenate_words : true
split_on_case_change : true
preserve_original : true
split_on_numerics: true
analyzer :
default :
type : custom
tokenizer : uax_url_email
filter : [lowercase,snowball,wd_filter]
char_filter : [html_strip]

But when I index a doc into 'content' and then examine the indexed
files with:
curl -XGET 'http://localhost:9200/mgs/p2/385?fields=content&pretty'

I still am seeing all of the html tags in the text. Reading back the
_mapping for the "content" field it is:
"content" : {
"store" : "yes",
"type" : "string"
},
'index' defaults to 'analyzed', so this appears correct.

Further complicating is that if I run:
curl -XGET 'http://localhost:9200/mgs/_analyze' -d '

this is
a tests

'

Then I get:
{"tokens":[{"token":"this","start_offset":3,"end_offset":
7,"type":"","position":1},{"token":"is","start_offset":
11,"end_offset":13,"type":"","position":2},
{"token":"a","start_offset":18,"end_offset":
19,"type":"","position":3},{"token":"test","start_offset":
20,"end_offset":25,"type":"","position":4}]}

Which appears correct as the tags have been stripped out by using the
default analyzer, and the word 'tests' has been stemmed to 'test'.

So the analyzer seems to be working correctly except for when I
actually add a document. I am using Elastica for adding documents,
but I tried

curl -XPUT 'http://localhost:9200/mgs/p2/0' -d '{"content" : "

This is a tests

" }'

Then:

curl -XGET 'http://localhost:9200/mgs/p2/0?fields=content'
Gives:
{"_index":"mgs","_type":"p2","_id":"0","_version":2,"fields":
{"content":"

This is a tests

"}}

So adding the document does not seem to be causing the analyzer to be
run on the added document. Any ideas on what I am missing/doing
wrong?

Thanks
-Greg


(Shay Banon) #2

The stored version of a field stores the content as is, and not the analyzed form. The analysis process only controls how the text is broken down into terms and indexed.

On Thursday, June 2, 2011 at 2:32 AM, Greg B wrote:

I've configured a default custom analyzer as follows:

index :
analysis :
filter :
snowball :
type : snowball
language : English
wd__filter :
type : word_delimiter
generate_word_parts : true
generate_number_parts : true
catenate__words : true
split_on_case_change : true
preserve_original : true
split_on_numerics: true
analyzer :
default :
type : custom
tokenizer : uax__url_email
filter : [lowercase,snowball,wd_filter]
char_filter : [html_strip]

But when I index a doc into 'content' and then examine the indexed
files with:
curl -XGET 'http://localhost:9200/mgs/p2/385?fields=content&pretty'

I still am seeing all of the html tags in the text. Reading back the
_mapping for the "content" field it is:
"content" : {
"store" : "yes",
""type" : "string"
},
'index' defaults to 'analyzed', so this appears correct.

Further complicating is that if I run:
curl -XGET 'http://localhost:9200/mgs/_analyze' -d '

this is
a tests

'

Then I get:
{"tokens":[{"token":"this","start_offset":3,"end_offset":
7,"type":"","position":1},{"token":"is","start_offset":
11,"end_offset":13,"type":"","position":2},
{"token":"a","start_offset":18,"end_offset":
19,"type":"","position":3},{"token":"test","start_offset":
20,"end_offset":25,"type":"","position":4}]}

Which appears correct as the tags have been stripped out by using the
default analyzer, and the word 'tests' has been stemmed to 'test'.

So the analyzer seems to be working correctly except for when I
actually add a document. I am using Elastica for adding documents,
but I tried

curl -XPUT 'http://localhost:9200/mgs/p2/0' -d '{"content" : "

This is a tests

" }'

Then:

curl -XGET 'http://localhost:9200/mgs/p2/0?fields=content'
Gives:
{"_index":"mgs","_type":"p2","_id":"0","_version":2,"fields":
{"content":"

This is a tests

"}}

So adding the document does not seem to be causing the analyzer to be
run on the added document. Any ideas on what I am missing/doing
wrong?

Thanks
-Greg


(Greg Brown) #3

Hi Shay, thanks for the fast response.

Is there a way to store the version with the html stripped, or do I
need to implement my own stripping to remove the html tags. The tags
are particularly a problem when trying to do highlighting in my search
results where the extraneous tags rather thoroughly screw up my
results page.

Thanks
-Greg

On Jun 2, 1:42 am, Shay Banon shay.ba...@elasticsearch.com wrote:

The stored version of a field stores the content as is, and not the analyzed form. The analysis process only controls how the text is broken down into terms and indexed.

On Thursday, June 2, 2011 at 2:32 AM, Greg B wrote:

I've configured a default custom analyzer as follows:

index :
analysis :
filter :
snowball :
type : snowball
language : English
wd__filter :
type : word_delimiter
generate_word_parts : true
generate_number_parts : true
catenate__words : true
split_on_case_change : true
preserve_original : true
split_on_numerics: true
analyzer :
default :
type : custom
tokenizer : uax__url_email
filter : [lowercase,snowball,wd_filter]
char_filter : [html_strip]

But when I index a doc into 'content' and then examine the indexed
files with:
curl -XGET 'http://localhost:9200/mgs/p2/385?fields=content&pretty'

I still am seeing all of the html tags in the text. Reading back the
_mapping for the "content" field it is:
"content" : {
"store" : "yes",
""type" : "string"
},
'index' defaults to 'analyzed', so this appears correct.

Further complicating is that if I run:
curl -XGET 'http://localhost:9200/mgs/_analyze'-d '

this is
a tests

'

Then I get:
{"tokens":[{"token":"this","start_offset":3,"end_offset":
7,"type":"","position":1},{"token":"is","start_offset":
11,"end_offset":13,"type":"","position":2},
{"token":"a","start_offset":18,"end_offset":
19,"type":"","position":3},{"token":"test","start_offset":
20,"end_offset":25,"type":"","position":4}]}

Which appears correct as the tags have been stripped out by using the
default analyzer, and the word 'tests' has been stemmed to 'test'.

So the analyzer seems to be working correctly except for when I
actually add a document. I am using Elastica for adding documents,
but I tried

curl -XPUT 'http://localhost:9200/mgs/p2/0'-d '{"content" : "

This is a tests

" }'

Then:

curl -XGET 'http://localhost:9200/mgs/p2/0?fields=content'
Gives:
{"_index":"mgs","_type":"p2","_id":"0","_version":2,"fields":
{"content":"

This is a tests

"}}

So adding the document does not seem to be causing the analyzer to be
run on the added document. Any ideas on what I am missing/doing
wrong?

Thanks
-Greg


(Shay Banon) #4

In this case, you will need to do your own stripping.

On Thursday, June 2, 2011 at 6:56 PM, Greg B wrote:

Hi Shay, thanks for the fast response.

Is there a way to store the version with the html stripped, or do I
need to implement my own stripping to remove the html tags. The tags
are particularly a problem when trying to do highlighting in my search
results where the extraneous tags rather thoroughly screw up my
results page.

Thanks
-Greg

On Jun 2, 1:42 am, Shay Banon <shay.ba...@elasticsearch.com (http://elasticsearch.com)> wrote:

The stored version of a field stores the content as is, and not the analyzed form. The analysis process only controls how the text is broken down into terms and indexed.

On Thursday, June 2, 2011 at 2:32 AM, Greg B wrote:

I've configured a default custom analyzer as follows:

index :
analysis :
filter :
snowball :
type : snowball
language : English
wd__filter :
type : word_delimiter
generate_word_parts : true
generate_number_parts : true
catenate__words : true
split_on_case_change : true
preserve_original : true
split_on_numerics: true
analyzer :
default :
type : custom
tokenizer : uax__url_email
filter : [lowercase,snowball,wd_filter]
char_filter : [html_strip]

But when I index a doc into 'content' and then examine the indexed
files with:
curl -XGET 'http://localhost:9200/mgs/p2/385?fields=content&pretty'

I still am seeing all of the html tags in the text. Reading back the
_mapping for the "content" field it is:
"content" : {
"store" : "yes",
""type" : "string"
},
'index' defaults to 'analyzed', so this appears correct.

Further complicating is that if I run:
curl -XGET 'http://localhost:9200/mgs/_analyze'-d '

this is
a tests

'

Then I get:
{"tokens":[{"token":"this","start_offset":3,"end_offset":
7,"type":"","position":1},{"token":"is","start_offset":
11,"end_offset":13,"type":"","position":2},
{"token":"a","start_offset":18,"end_offset":
19,"type":"","position":3},{"token":"test","start_offset":
20,"end_offset":25,"type":"","position":4}]}

Which appears correct as the tags have been stripped out by using the
default analyzer, and the word 'tests' has been stemmed to 'test'.

So the analyzer seems to be working correctly except for when I
actually add a document. I am using Elastica for adding documents,
but I tried

curl -XPUT 'http://localhost:9200/mgs/p2/0'-d '{"content" : "

This is a tests

" }'

Then:

curl -XGET 'http://localhost:9200/mgs/p2/0?fields=content'
Gives:
{"_index":"mgs","_type":"p2","_id":"0","_version":2,"fields":
{"content":"

This is a tests

"}}

So adding the document does not seem to be causing the analyzer to be
run on the added document. Any ideas on what I am missing/doing
wrong?

Thanks
-Greg


(Administrator-2) #5

Greg,

You have to do the stripping yourself.

  • Nick

-----Original Message-----
From: Greg B [mailto:gbrown5878@gmail.com]
Sent: Thursday, June 02, 2011 11:56 AM
To: users
Subject: SPAM-LOW: Re: Help stripping HTML tags

Hi Shay, thanks for the fast response.

Is there a way to store the version with the html stripped, or do I
need to implement my own stripping to remove the html tags. The tags
are particularly a problem when trying to do highlighting in my search
results where the extraneous tags rather thoroughly screw up my
results page.

Thanks
-Greg

On Jun 2, 1:42 am, Shay Banon shay.ba...@elasticsearch.com wrote:

The stored version of a field stores the content as is, and not the
analyzed form. The analysis process only controls how the text is broken
down into terms and indexed.

On Thursday, June 2, 2011 at 2:32 AM, Greg B wrote:

I've configured a default custom analyzer as follows:

index :
analysis :
filter :
snowball :
type : snowball
language : English
wd__filter :
type : word_delimiter
generate_word_parts : true
generate_number_parts : true
catenate__words : true
split_on_case_change : true
preserve_original : true
split_on_numerics: true
analyzer :
default :
type : custom
tokenizer : uax__url_email
filter : [lowercase,snowball,wd_filter]
char_filter : [html_strip]

But when I index a doc into 'content' and then examine the indexed
files with:
curl -XGET 'http://localhost:9200/mgs/p2/385?fields=content&pretty'

I still am seeing all of the html tags in the text. Reading back the
_mapping for the "content" field it is:
"content" : {
"store" : "yes",
""type" : "string"
},
'index' defaults to 'analyzed', so this appears correct.

Further complicating is that if I run:
curl -XGET 'http://localhost:9200/mgs/_analyze'-d '

this is
a tests

'

Then I get:
{"tokens":[{"token":"this","start_offset":3,"end_offset":
7,"type":"","position":1},{"token":"is","start_offset":
11,"end_offset":13,"type":"","position":2},
{"token":"a","start_offset":18,"end_offset":
19,"type":"","position":3},{"token":"test","start_offset":
20,"end_offset":25,"type":"","position":4}]}

Which appears correct as the tags have been stripped out by using the
default analyzer, and the word 'tests' has been stemmed to 'test'.

So the analyzer seems to be working correctly except for when I
actually add a document. I am using Elastica for adding documents,
but I tried

curl -XPUT 'http://localhost:9200/mgs/p2/0'-d '{"content" : "

This
is a tests

" }'

Then:

curl -XGET 'http://localhost:9200/mgs/p2/0?fields=content'
Gives:
{"_index":"mgs","_type":"p2","_id":"0","_version":2,"fields":
{"content":"

This is a tests

"}}

So adding the document does not seem to be causing the analyzer to be
run on the added document. Any ideas on what I am missing/doing
wrong?

Thanks
-Greg


(Greg Brown) #6

OK, thanks.
-Greg

On Jun 2, 10:06 am, "Administrator" ad...@sf4answers.com wrote:

Greg,

You have to do the stripping yourself.

  • Nick

-----Original Message-----
From: Greg B [mailto:gbrown5...@gmail.com]
Sent: Thursday, June 02, 2011 11:56 AM
To: users
Subject: SPAM-LOW: Re: Help stripping HTML tags

Hi Shay, thanks for the fast response.

Is there a way to store the version with the html stripped, or do I
need to implement my own stripping to remove the html tags. The tags
are particularly a problem when trying to do highlighting in my search
results where the extraneous tags rather thoroughly screw up my
results page.

Thanks
-Greg

On Jun 2, 1:42 am, Shay Banon shay.ba...@elasticsearch.com wrote:

The stored version of a field stores the content as is, and not the
analyzed form. The analysis process only controls how the text is broken
down into terms and indexed.

On Thursday, June 2, 2011 at 2:32 AM, Greg B wrote:

I've configured a default custom analyzer as follows:

index :
analysis :
filter :
snowball :
type : snowball
language : English
wd__filter :
type : word_delimiter
generate_word_parts : true
generate_number_parts : true
catenate__words : true
split_on_case_change : true
preserve_original : true
split_on_numerics: true
analyzer :
default :
type : custom
tokenizer : uax__url_email
filter : [lowercase,snowball,wd_filter]
char_filter : [html_strip]

But when I index a doc into 'content' and then examine the indexed
files with:
curl -XGET 'http://localhost:9200/mgs/p2/385?fields=content&pretty'

I still am seeing all of the html tags in the text. Reading back the
_mapping for the "content" field it is:
"content" : {
"store" : "yes",
""type" : "string"
},
'index' defaults to 'analyzed', so this appears correct.

Further complicating is that if I run:
curl -XGET 'http://localhost:9200/mgs/_analyze'-d'

this is
a tests

'

Then I get:
{"tokens":[{"token":"this","start_offset":3,"end_offset":
7,"type":"","position":1},{"token":"is","start_offset":
11,"end_offset":13,"type":"","position":2},
{"token":"a","start_offset":18,"end_offset":
19,"type":"","position":3},{"token":"test","start_offset":
20,"end_offset":25,"type":"","position":4}]}

Which appears correct as the tags have been stripped out by using the
default analyzer, and the word 'tests' has been stemmed to 'test'.

So the analyzer seems to be working correctly except for when I
actually add a document. I am using Elastica for adding documents,
but I tried

curl -XPUT 'http://localhost:9200/mgs/p2/0'-d'{"content" : "

This
is a tests

" }'

Then:

curl -XGET 'http://localhost:9200/mgs/p2/0?fields=content'
Gives:
{"_index":"mgs","_type":"p2","_id":"0","_version":2,"fields":
{"content":"

This is a tests

"}}

So adding the document does not seem to be causing the analyzer to be
run on the added document. Any ideas on what I am missing/doing
wrong?

Thanks
-Greg


(system) #7