How to use standard_html_strip

Sam_3 · December 14, 2011, 8:30pm

Hi! I'm new to ElasticSearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at

gist.github.com

https://gist.github.com/sbrauer/1478233

gistfile1.txt

$ curl -XPUT http://localhost:9200/foo

{"ok":true,"acknowledged":true}

$ curl -XPUT http://localhost:9200/foo/bar/_mapping -d '{
  "properties" : {
    "body": {"type":"string", "analyzer":"standard_html_strip"}
  }
}'

This file has been truncated. show original

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits, but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

Sam

Lukas_Vlcek1 · December 14, 2011, 10:07pm

Hi,

Can you show how you defined "standard_html_strip" analyzer? That would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip" analyzer defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to Elasticsearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at
trying to use standard_html_strip in ElasticSearch · GitHub

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits, but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

Sam

Sam_3 · December 15, 2011, 6:30pm

Thanks, Lukas!
I didn't defined the analyzer and haven't even touched
elasticsearch.yml.
I'll give that a try soon...

I was under the impression that standard_html_strip is one of the
analyzers included with elasticsearch. Here's the commit that added
it: add `standard_html_strip` analyzer that combines the standard analyze… · elastic/elasticsearch@1a18387 · GitHub

But I couldn't find any instructions on how to use it.

Sam

On Dec 14, 5:07 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

Can you show how you defined "standard_html_strip" analyzer? That would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip" analyzer defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to Elasticsearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at
trying to use standard_html_strip in ElasticSearch · GitHub

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits, but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

Sam

Sam_3 · December 19, 2011, 2:50pm

I tried adding the suggested configuration to elasticsearch.yml, but
it still seems that HTML isn't being stripped.
I'm stumped enough to give up on this approach for now and extract the
plain text from my HTML fields before indexing.
I still wish I knew how to make this work. I like the idea of off-
loading the html stripping from my application.

On Dec 15, 1:30 pm, Sam sam.bra...@gmail.com wrote:

Thanks, Lukas!
I didn't defined the analyzer and haven't even touched
elasticsearch.yml.
I'll give that a try soon...

I was under the impression that standard_html_strip is one of the
analyzers included with elasticsearch. Here's the commit that added
it:add `standard_html_strip` analyzer that combines the standard analyze… · elastic/elasticsearch@1a18387 · GitHub...

But I couldn't find any instructions on how to use it.

Sam

On Dec 14, 5:07 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

Can you show how you defined "standard_html_strip" analyzer? That would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip" analyzer defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to Elasticsearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at
trying to use standard_html_strip in ElasticSearch · GitHub

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits, but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

Sam

kimchy · December 19, 2011, 6:37pm

Just to make sure: The html striping provided as as part of the analysis
process will not cause them to be stripeed from the _source, just the
indexed terms will be stripeed from html.

On Mon, Dec 19, 2011 at 4:50 PM, Sam sam.brauer@gmail.com wrote:

I tried adding the suggested configuration to elasticsearch.yml, but
it still seems that HTML isn't being stripped.
I'm stumped enough to give up on this approach for now and extract the
plain text from my HTML fields before indexing.
I still wish I knew how to make this work. I like the idea of off-
loading the html stripping from my application.

On Dec 15, 1:30 pm, Sam sam.bra...@gmail.com wrote:

Thanks, Lukas!
I didn't defined the analyzer and haven't even touched
elasticsearch.yml.
I'll give that a try soon...

I was under the impression that standard_html_strip is one of the
analyzers included with elasticsearch. Here's the commit that added
it:
add `standard_html_strip` analyzer that combines the standard analyze… · elastic/elasticsearch@1a18387 · GitHub...

But I couldn't find any instructions on how to use it.

Sam

On Dec 14, 5:07 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

Can you show how you defined "standard_html_strip" analyzer? That
would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip" analyzer
defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to Elasticsearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at
trying to use standard_html_strip in ElasticSearch · GitHub

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits,
but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

Sam

Sam_3 · December 20, 2011, 3:46am

Right. I'm not surprised to see the html markup in the stored values.
But I am surprised to get a search hit for "strong" when that string
only occurs as an element name.

On Dec 19, 1:37 pm, Shay Banon kim...@gmail.com wrote:

Just to make sure: The html striping provided as as part of the analysis
process will not cause them to be stripeed from the _source, just the
indexed terms will be stripeed from html.

On Mon, Dec 19, 2011 at 4:50 PM, Sam sam.bra...@gmail.com wrote:

I tried adding the suggested configuration to elasticsearch.yml, but
it still seems that HTML isn't being stripped.
I'm stumped enough to give up on this approach for now and extract the
plain text from my HTML fields before indexing.
I still wish I knew how to make this work. I like the idea of off-
loading the html stripping from my application.

On Dec 15, 1:30 pm, Sam sam.bra...@gmail.com wrote:

Thanks, Lukas!
I didn't defined the analyzer and haven't even touched
elasticsearch.yml.
I'll give that a try soon...

I was under the impression that standard_html_strip is one of the
analyzers included with elasticsearch. Here's the commit that added
it:
add `standard_html_strip` analyzer that combines the standard analyze… · elastic/elasticsearch@1a18387 · GitHub...

But I couldn't find any instructions on how to use it.

Sam

On Dec 14, 5:07 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

Can you show how you defined "standard_html_strip" analyzer? That
would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip" analyzer
defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to Elasticsearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at
trying to use standard_html_strip in ElasticSearch · GitHub

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits,
but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

Sam

ppearcy · December 20, 2011, 8:28am

I need to strip HTML before inserting into source and just hooked into
the filter directly in my app. I did this because I needed to
guarantee the highlight blurbs I returned contained valid HTML.
However, it is probably a relatively common case for users to want to
strip HTML before adding to source. Might be a good feature?

Thanks!
Paul

On Dec 19, 8:46 pm, Sam sam.bra...@gmail.com wrote:

Right. I'm not surprised to see the html markup in the stored values.
But I am surprised to get a search hit for "strong" when that string
only occurs as an element name.

On Dec 19, 1:37 pm, Shay Banon kim...@gmail.com wrote:

Just to make sure: The html striping provided as as part of the analysis
process will not cause them to be stripeed from the _source, just the
indexed terms will be stripeed from html.

On Mon, Dec 19, 2011 at 4:50 PM, Sam sam.bra...@gmail.com wrote:

I tried adding the suggested configuration to elasticsearch.yml, but
it still seems that HTML isn't being stripped.
I'm stumped enough to give up on this approach for now and extract the
plain text from my HTML fields before indexing.
I still wish I knew how to make this work. I like the idea of off-
loading the html stripping from my application.

On Dec 15, 1:30 pm, Sam sam.bra...@gmail.com wrote:

Thanks, Lukas!
I didn't defined the analyzer and haven't even touched
elasticsearch.yml.
I'll give that a try soon...

I was under the impression that standard_html_strip is one of the
analyzers included with elasticsearch. Here's the commit that added
it:
add `standard_html_strip` analyzer that combines the standard analyze… · elastic/elasticsearch@1a18387 · GitHub...

But I couldn't find any instructions on how to use it.

Sam

On Dec 14, 5:07 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

Can you show how you defined "standard_html_strip" analyzer? That
would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip" analyzer
defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to Elasticsearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at
trying to use standard_html_strip in ElasticSearch · GitHub

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits,
but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

Sam

kimchy · December 20, 2011, 10:00am

Stripping before "adding" the source on elasticsearch side might get into
the area where elasticsearch starts doing things that are not in its realm,
which complicates it.

On Tue, Dec 20, 2011 at 10:28 AM, ppearcy ppearcy@gmail.com wrote:

I need to strip HTML before inserting into source and just hooked into
the filter directly in my app. I did this because I needed to
guarantee the highlight blurbs I returned contained valid HTML.
However, it is probably a relatively common case for users to want to
strip HTML before adding to source. Might be a good feature?

Thanks!
Paul

On Dec 19, 8:46 pm, Sam sam.bra...@gmail.com wrote:

Right. I'm not surprised to see the html markup in the stored values.
But I am surprised to get a search hit for "strong" when that string
only occurs as an element name.

On Dec 19, 1:37 pm, Shay Banon kim...@gmail.com wrote:

Just to make sure: The html striping provided as as part of the
analysis
process will not cause them to be stripeed from the _source, just the
indexed terms will be stripeed from html.

On Mon, Dec 19, 2011 at 4:50 PM, Sam sam.bra...@gmail.com wrote:

I tried adding the suggested configuration to elasticsearch.yml, but
it still seems that HTML isn't being stripped.
I'm stumped enough to give up on this approach for now and extract
the
plain text from my HTML fields before indexing.
I still wish I knew how to make this work. I like the idea of off-
loading the html stripping from my application.

On Dec 15, 1:30 pm, Sam sam.bra...@gmail.com wrote:

Thanks, Lukas!
I didn't defined the analyzer and haven't even touched
elasticsearch.yml.
I'll give that a try soon...

I was under the impression that standard_html_strip is one of the
analyzers included with elasticsearch. Here's the commit that
added
it:

add `standard_html_strip` analyzer that combines the standard analyze… · elastic/elasticsearch@1a18387 · GitHub...

But I couldn't find any instructions on how to use it.

Sam

On Dec 14, 5:07 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

Can you show how you defined "standard_html_strip" analyzer? That
would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip"
analyzer
defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to Elasticsearch and having trouble trying to get
the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look
at
trying to use standard_html_strip in ElasticSearch · GitHub

In that example, I create a mapping with a single string field
"body"
using the standard_html_strip analyzer. Then I index one
document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero
hits,
but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

Sam

Topic		Replies	Views
Problem with standard_html_strip Elasticsearch	4	1934	July 6, 2017
Help stripping HTML tags Elasticsearch	6	588	July 6, 2017
Sending HTML through REST API for html_strip Elasticsearch	2	968	July 5, 2017
Simple question about html stripping Elasticsearch	4	374	July 6, 2017
Strip_HTML on indexing does not store results? Elasticsearch	10	918	July 6, 2017

How to use standard_html_strip

Related topics