How to use standard_html_strip

Hi! I'm new to ElasticSearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits, but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

  • Sam

Hi,

Can you show how you defined "standard_html_strip" analyzer? That would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip" analyzer defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to Elasticsearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at
trying to use standard_html_strip in ElasticSearch · GitHub

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits, but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

  • Sam

Thanks, Lukas!
I didn't defined the analyzer and haven't even touched
elasticsearch.yml.
I'll give that a try soon...

I was under the impression that standard_html_strip is one of the
analyzers included with elasticsearch. Here's the commit that added
it: add `standard_html_strip` analyzer that combines the standard analyze… · elastic/elasticsearch@1a18387 · GitHub

But I couldn't find any instructions on how to use it.

  • Sam

On Dec 14, 5:07 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

Can you show how you defined "standard_html_strip" analyzer? That would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip" analyzer defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to Elasticsearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at
trying to use standard_html_strip in ElasticSearch · GitHub

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits, but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

  • Sam

I tried adding the suggested configuration to elasticsearch.yml, but
it still seems that HTML isn't being stripped.
I'm stumped enough to give up on this approach for now and extract the
plain text from my HTML fields before indexing.
I still wish I knew how to make this work. I like the idea of off-
loading the html stripping from my application.

On Dec 15, 1:30 pm, Sam sam.bra...@gmail.com wrote:

Thanks, Lukas!
I didn't defined the analyzer and haven't even touched
elasticsearch.yml.
I'll give that a try soon...

I was under the impression that standard_html_strip is one of the
analyzers included with elasticsearch. Here's the commit that added
it:add `standard_html_strip` analyzer that combines the standard analyze… · elastic/elasticsearch@1a18387 · GitHub...

But I couldn't find any instructions on how to use it.

  • Sam

On Dec 14, 5:07 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

Can you show how you defined "standard_html_strip" analyzer? That would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip" analyzer defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to Elasticsearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at
trying to use standard_html_strip in ElasticSearch · GitHub

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits, but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

  • Sam

Just to make sure: The html striping provided as as part of the analysis
process will not cause them to be stripeed from the _source, just the
indexed terms will be stripeed from html.

On Mon, Dec 19, 2011 at 4:50 PM, Sam sam.brauer@gmail.com wrote:

I tried adding the suggested configuration to elasticsearch.yml, but
it still seems that HTML isn't being stripped.
I'm stumped enough to give up on this approach for now and extract the
plain text from my HTML fields before indexing.
I still wish I knew how to make this work. I like the idea of off-
loading the html stripping from my application.

On Dec 15, 1:30 pm, Sam sam.bra...@gmail.com wrote:

Thanks, Lukas!
I didn't defined the analyzer and haven't even touched
elasticsearch.yml.
I'll give that a try soon...

I was under the impression that standard_html_strip is one of the
analyzers included with elasticsearch. Here's the commit that added
it:
add `standard_html_strip` analyzer that combines the standard analyze… · elastic/elasticsearch@1a18387 · GitHub...

But I couldn't find any instructions on how to use it.

  • Sam

On Dec 14, 5:07 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

Can you show how you defined "standard_html_strip" analyzer? That
would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip" analyzer
defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to Elasticsearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at
trying to use standard_html_strip in ElasticSearch · GitHub

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits,
but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

  • Sam

Right. I'm not surprised to see the html markup in the stored values.
But I am surprised to get a search hit for "strong" when that string
only occurs as an element name.

On Dec 19, 1:37 pm, Shay Banon kim...@gmail.com wrote:

Just to make sure: The html striping provided as as part of the analysis
process will not cause them to be stripeed from the _source, just the
indexed terms will be stripeed from html.

On Mon, Dec 19, 2011 at 4:50 PM, Sam sam.bra...@gmail.com wrote:

I tried adding the suggested configuration to elasticsearch.yml, but
it still seems that HTML isn't being stripped.
I'm stumped enough to give up on this approach for now and extract the
plain text from my HTML fields before indexing.
I still wish I knew how to make this work. I like the idea of off-
loading the html stripping from my application.

On Dec 15, 1:30 pm, Sam sam.bra...@gmail.com wrote:

Thanks, Lukas!
I didn't defined the analyzer and haven't even touched
elasticsearch.yml.
I'll give that a try soon...

I was under the impression that standard_html_strip is one of the
analyzers included with elasticsearch. Here's the commit that added
it:
add `standard_html_strip` analyzer that combines the standard analyze… · elastic/elasticsearch@1a18387 · GitHub...

But I couldn't find any instructions on how to use it.

  • Sam

On Dec 14, 5:07 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

Can you show how you defined "standard_html_strip" analyzer? That
would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip" analyzer
defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to Elasticsearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at
trying to use standard_html_strip in ElasticSearch · GitHub

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits,
but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

  • Sam

I need to strip HTML before inserting into source and just hooked into
the filter directly in my app. I did this because I needed to
guarantee the highlight blurbs I returned contained valid HTML.
However, it is probably a relatively common case for users to want to
strip HTML before adding to source. Might be a good feature?

Thanks!
Paul

On Dec 19, 8:46 pm, Sam sam.bra...@gmail.com wrote:

Right. I'm not surprised to see the html markup in the stored values.
But I am surprised to get a search hit for "strong" when that string
only occurs as an element name.

On Dec 19, 1:37 pm, Shay Banon kim...@gmail.com wrote:

Just to make sure: The html striping provided as as part of the analysis
process will not cause them to be stripeed from the _source, just the
indexed terms will be stripeed from html.

On Mon, Dec 19, 2011 at 4:50 PM, Sam sam.bra...@gmail.com wrote:

I tried adding the suggested configuration to elasticsearch.yml, but
it still seems that HTML isn't being stripped.
I'm stumped enough to give up on this approach for now and extract the
plain text from my HTML fields before indexing.
I still wish I knew how to make this work. I like the idea of off-
loading the html stripping from my application.

On Dec 15, 1:30 pm, Sam sam.bra...@gmail.com wrote:

Thanks, Lukas!
I didn't defined the analyzer and haven't even touched
elasticsearch.yml.
I'll give that a try soon...

I was under the impression that standard_html_strip is one of the
analyzers included with elasticsearch. Here's the commit that added
it:
add `standard_html_strip` analyzer that combines the standard analyze… · elastic/elasticsearch@1a18387 · GitHub...

But I couldn't find any instructions on how to use it.

  • Sam

On Dec 14, 5:07 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

Can you show how you defined "standard_html_strip" analyzer? That
would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip" analyzer
defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to Elasticsearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at
trying to use standard_html_strip in ElasticSearch · GitHub

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits,
but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

  • Sam

Stripping before "adding" the source on elasticsearch side might get into
the area where elasticsearch starts doing things that are not in its realm,
which complicates it.

On Tue, Dec 20, 2011 at 10:28 AM, ppearcy ppearcy@gmail.com wrote:

I need to strip HTML before inserting into source and just hooked into
the filter directly in my app. I did this because I needed to
guarantee the highlight blurbs I returned contained valid HTML.
However, it is probably a relatively common case for users to want to
strip HTML before adding to source. Might be a good feature?

Thanks!
Paul

On Dec 19, 8:46 pm, Sam sam.bra...@gmail.com wrote:

Right. I'm not surprised to see the html markup in the stored values.
But I am surprised to get a search hit for "strong" when that string
only occurs as an element name.

On Dec 19, 1:37 pm, Shay Banon kim...@gmail.com wrote:

Just to make sure: The html striping provided as as part of the
analysis
process will not cause them to be stripeed from the _source, just the
indexed terms will be stripeed from html.

On Mon, Dec 19, 2011 at 4:50 PM, Sam sam.bra...@gmail.com wrote:

I tried adding the suggested configuration to elasticsearch.yml, but
it still seems that HTML isn't being stripped.
I'm stumped enough to give up on this approach for now and extract
the
plain text from my HTML fields before indexing.
I still wish I knew how to make this work. I like the idea of off-
loading the html stripping from my application.

On Dec 15, 1:30 pm, Sam sam.bra...@gmail.com wrote:

Thanks, Lukas!
I didn't defined the analyzer and haven't even touched
elasticsearch.yml.
I'll give that a try soon...

I was under the impression that standard_html_strip is one of the
analyzers included with elasticsearch. Here's the commit that
added
it:

add `standard_html_strip` analyzer that combines the standard analyze… · elastic/elasticsearch@1a18387 · GitHub...

But I couldn't find any instructions on how to use it.

  • Sam

On Dec 14, 5:07 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

Can you show how you defined "standard_html_strip" analyzer? That
would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip"
analyzer
defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to Elasticsearch and having trouble trying to get
the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look
at
trying to use standard_html_strip in ElasticSearch · GitHub

In that example, I create a mapping with a single string field
"body"
using the standard_html_strip analyzer. Then I index one
document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero
hits,
but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

  • Sam