How to use standard_html_strip


(Sam-3) #1

Hi! I'm new to ElasticSearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits, but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

  • Sam

(Lukáš Vlček) #2

Hi,

Can you show how you defined "standard_html_strip" analyzer? That would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip" analyzer defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to ElasticSearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at
https://gist.github.com/1478233

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits, but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

  • Sam

(Sam-3) #3

Thanks, Lukas!
I didn't defined the analyzer and haven't even touched
elasticsearch.yml.
I'll give that a try soon...

I was under the impression that standard_html_strip is one of the
analyzers included with elasticsearch. Here's the commit that added
it: https://github.com/elasticsearch/elasticsearch/commit/1a18387fabc129a86b2d17fa834e5642ee4f5a70

But I couldn't find any instructions on how to use it.

  • Sam

On Dec 14, 5:07 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

Can you show how you defined "standard_html_strip" analyzer? That would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip" analyzer defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to ElasticSearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at
https://gist.github.com/1478233

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits, but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

  • Sam

(Sam-3) #4

I tried adding the suggested configuration to elasticsearch.yml, but
it still seems that HTML isn't being stripped.
I'm stumped enough to give up on this approach for now and extract the
plain text from my HTML fields before indexing.
I still wish I knew how to make this work. I like the idea of off-
loading the html stripping from my application.

On Dec 15, 1:30 pm, Sam sam.bra...@gmail.com wrote:

Thanks, Lukas!
I didn't defined the analyzer and haven't even touched
elasticsearch.yml.
I'll give that a try soon...

I was under the impression that standard_html_strip is one of the
analyzers included with elasticsearch. Here's the commit that added
it:https://github.com/elasticsearch/elasticsearch/commit/1a18387fabc129a...

But I couldn't find any instructions on how to use it.

  • Sam

On Dec 14, 5:07 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

Can you show how you defined "standard_html_strip" analyzer? That would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip" analyzer defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to ElasticSearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at
https://gist.github.com/1478233

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits, but
instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

  • Sam

(Shay Banon) #5

Just to make sure: The html striping provided as as part of the analysis
process will not cause them to be stripeed from the _source, just the
indexed terms will be stripeed from html.

On Mon, Dec 19, 2011 at 4:50 PM, Sam sam.brauer@gmail.com wrote:

I tried adding the suggested configuration to elasticsearch.yml, but
it still seems that HTML isn't being stripped.
I'm stumped enough to give up on this approach for now and extract the
plain text from my HTML fields before indexing.
I still wish I knew how to make this work. I like the idea of off-
loading the html stripping from my application.

On Dec 15, 1:30 pm, Sam sam.bra...@gmail.com wrote:

Thanks, Lukas!
I didn't defined the analyzer and haven't even touched
elasticsearch.yml.
I'll give that a try soon...

I was under the impression that standard_html_strip is one of the
analyzers included with elasticsearch. Here's the commit that added
it:
https://github.com/elasticsearch/elasticsearch/commit/1a18387fabc129a...

But I couldn't find any instructions on how to use it.

  • Sam

On Dec 14, 5:07 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

Can you show how you defined "standard_html_strip" analyzer? That
would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip" analyzer
defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to ElasticSearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at
https://gist.github.com/1478233

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits,
but

instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

  • Sam

(Sam-3) #6

Right. I'm not surprised to see the html markup in the stored values.
But I am surprised to get a search hit for "strong" when that string
only occurs as an element name.

On Dec 19, 1:37 pm, Shay Banon kim...@gmail.com wrote:

Just to make sure: The html striping provided as as part of the analysis
process will not cause them to be stripeed from the _source, just the
indexed terms will be stripeed from html.

On Mon, Dec 19, 2011 at 4:50 PM, Sam sam.bra...@gmail.com wrote:

I tried adding the suggested configuration to elasticsearch.yml, but
it still seems that HTML isn't being stripped.
I'm stumped enough to give up on this approach for now and extract the
plain text from my HTML fields before indexing.
I still wish I knew how to make this work. I like the idea of off-
loading the html stripping from my application.

On Dec 15, 1:30 pm, Sam sam.bra...@gmail.com wrote:

Thanks, Lukas!
I didn't defined the analyzer and haven't even touched
elasticsearch.yml.
I'll give that a try soon...

I was under the impression that standard_html_strip is one of the
analyzers included with elasticsearch. Here's the commit that added
it:
https://github.com/elasticsearch/elasticsearch/commit/1a18387fabc129a...

But I couldn't find any instructions on how to use it.

  • Sam

On Dec 14, 5:07 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

Can you show how you defined "standard_html_strip" analyzer? That
would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip" analyzer
defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to ElasticSearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at
https://gist.github.com/1478233

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits,
but

instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

  • Sam

(ppearcy) #7

I need to strip HTML before inserting into source and just hooked into
the filter directly in my app. I did this because I needed to
guarantee the highlight blurbs I returned contained valid HTML.
However, it is probably a relatively common case for users to want to
strip HTML before adding to source. Might be a good feature?

Thanks!
Paul

On Dec 19, 8:46 pm, Sam sam.bra...@gmail.com wrote:

Right. I'm not surprised to see the html markup in the stored values.
But I am surprised to get a search hit for "strong" when that string
only occurs as an element name.

On Dec 19, 1:37 pm, Shay Banon kim...@gmail.com wrote:

Just to make sure: The html striping provided as as part of the analysis
process will not cause them to be stripeed from the _source, just the
indexed terms will be stripeed from html.

On Mon, Dec 19, 2011 at 4:50 PM, Sam sam.bra...@gmail.com wrote:

I tried adding the suggested configuration to elasticsearch.yml, but
it still seems that HTML isn't being stripped.
I'm stumped enough to give up on this approach for now and extract the
plain text from my HTML fields before indexing.
I still wish I knew how to make this work. I like the idea of off-
loading the html stripping from my application.

On Dec 15, 1:30 pm, Sam sam.bra...@gmail.com wrote:

Thanks, Lukas!
I didn't defined the analyzer and haven't even touched
elasticsearch.yml.
I'll give that a try soon...

I was under the impression that standard_html_strip is one of the
analyzers included with elasticsearch. Here's the commit that added
it:
https://github.com/elasticsearch/elasticsearch/commit/1a18387fabc129a...

But I couldn't find any instructions on how to use it.

  • Sam

On Dec 14, 5:07 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

Can you show how you defined "standard_html_strip" analyzer? That
would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip" analyzer
defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to ElasticSearch and having trouble trying to get the
standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look at
https://gist.github.com/1478233

In that example, I create a mapping with a single string field "body"
using the standard_html_strip analyzer. Then I index one document
with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero hits,
but

instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

  • Sam

(Shay Banon) #8

Stripping before "adding" the source on elasticsearch side might get into
the area where elasticsearch starts doing things that are not in its realm,
which complicates it.

On Tue, Dec 20, 2011 at 10:28 AM, ppearcy ppearcy@gmail.com wrote:

I need to strip HTML before inserting into source and just hooked into
the filter directly in my app. I did this because I needed to
guarantee the highlight blurbs I returned contained valid HTML.
However, it is probably a relatively common case for users to want to
strip HTML before adding to source. Might be a good feature?

Thanks!
Paul

On Dec 19, 8:46 pm, Sam sam.bra...@gmail.com wrote:

Right. I'm not surprised to see the html markup in the stored values.
But I am surprised to get a search hit for "strong" when that string
only occurs as an element name.

On Dec 19, 1:37 pm, Shay Banon kim...@gmail.com wrote:

Just to make sure: The html striping provided as as part of the
analysis

process will not cause them to be stripeed from the _source, just the
indexed terms will be stripeed from html.

On Mon, Dec 19, 2011 at 4:50 PM, Sam sam.bra...@gmail.com wrote:

I tried adding the suggested configuration to elasticsearch.yml, but
it still seems that HTML isn't being stripped.
I'm stumped enough to give up on this approach for now and extract
the

plain text from my HTML fields before indexing.
I still wish I knew how to make this work. I like the idea of off-
loading the html stripping from my application.

On Dec 15, 1:30 pm, Sam sam.bra...@gmail.com wrote:

Thanks, Lukas!
I didn't defined the analyzer and haven't even touched
elasticsearch.yml.
I'll give that a try soon...

I was under the impression that standard_html_strip is one of the
analyzers included with elasticsearch. Here's the commit that
added

it:

https://github.com/elasticsearch/elasticsearch/commit/1a18387fabc129a...

But I couldn't find any instructions on how to use it.

  • Sam

On Dec 14, 5:07 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

Can you show how you defined "standard_html_strip" analyzer? That
would be typically found in elasticsearch.yml file.

Something like:
index:
analysis:
analyzer:
standard_html_strip:
type: custom
tokenizer: standard
filter: [standard, lowercase, stop]
char_filter: [html_strip]

The point is that AFAIK there is no "standard_html_strip"
analyzer

defined out of the box.

--
Regards,
Lukas

On Wednesday, December 14, 2011 at 9:30 PM, Sam wrote:

Hi! I'm new to ElasticSearch and having trouble trying to get
the

standard_html_strip analyzer to work.
As a simplified example of what I'm trying to do, take a look
at

https://gist.github.com/1478233

In that example, I create a mapping with a single string field
"body"

using the standard_html_strip analyzer. Then I index one
document

with a "strong" tag in the body.
When I do a search for "strong", I expect that I'll get zero
hits,

but

instead I get a hit.

I'm using the latest stable release (0.18.5).
Thanks for any help or advice!

  • Sam

(system) #9