Indexing HTML

Hi all,

Been trying to figure this out and seem to be having no luck, so I thought
I'd throw my question up here.

Goal:

  • I want to be able to store HTML in a document, but not have it indexed.
    For example, I don't want searches for "<span style" to return results,
    but I want the raw HTML saved.

The use case is basically: We have a bunch of HTML documents, a user
performs a search, these documents are rendered in some research results in
their native HTML. I'd like to highlight the results but at this point I'd
be content just not screwing up the search results.

Things I've tried:

Field mapping, changing the default analyzer, using html_strip, setting
include_in_all to false, setting index to no and store to yes, etc.

What ends up happening in most cases is ES just ignores what I'm doing
(despite seeing the mapping correctly configured in the index), letting
searches like _all:<span work when that field shouldn't even be indexed nor
included in the _all field. In some cases I'd lose the
default analyzer and gain case sensitivity, so I stopped trying to do
this globally.

This seems like something that should be straight forward but I've wasted a
day on it. Any ideas? I've used the normal documentation, googled various
groups, etc. and just can't seem to get it to work.

Thanks for your time all.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I would probably store it as a binary BASE64 encoded content.
That way, it won't be touched anyhow by Elasticsearch.

My 2 cents.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 22 avr. 2013 à 23:24, Nathan Akeris nathan.akeris@gmail.com a écrit :

Hi all,

Been trying to figure this out and seem to be having no luck, so I thought I'd throw my question up here.

Goal:

  • I want to be able to store HTML in a document, but not have it indexed. For example, I don't want searches for "<span style" to return results, but I want the raw HTML saved.

The use case is basically: We have a bunch of HTML documents, a user performs a search, these documents are rendered in some research results in their native HTML. I'd like to highlight the results but at this point I'd be content just not screwing up the search results.

Things I've tried:

Field mapping, changing the default analyzer, using html_strip, setting include_in_all to false, setting index to no and store to yes, etc.

What ends up happening in most cases is ES just ignores what I'm doing (despite seeing the mapping correctly configured in the index), letting searches like _all:<span work when that field shouldn't even be indexed nor included in the _all field. In some cases I'd lose the default analyzer and gain case sensitivity, so I stopped trying to do this globally.

This seems like something that should be straight forward but I've wasted a day on it. Any ideas? I've used the normal documentation, googled various groups, etc. and just can't seem to get it to work.

Thanks for your time all.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks, that worked well. I guess I'm still confused as to what went wrong
before though - I'd like to use ES the way it's intended. Do these
features not work as described, or is there some sort of trick to getting
it to work? Has anyone else had these issues or is this type of request
normally straight forward?

Thanks again.

On Monday, April 22, 2013 5:28:37 PM UTC-4, David Pilato wrote:

I would probably store it as a binary BASE64 encoded content.
That way, it won't be touched anyhow by Elasticsearch.

My 2 cents.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr
| @scrutmydocs https://twitter.com/scrutmydocs

Le 22 avr. 2013 à 23:24, Nathan Akeris <nathan...@gmail.com <javascript:>>
a écrit :

Hi all,

Been trying to figure this out and seem to be having no luck, so I thought
I'd throw my question up here.

Goal:

  • I want to be able to store HTML in a document, but not have it indexed.
    For example, I don't want searches for "<span style" to return results,
    but I want the raw HTML saved.

The use case is basically: We have a bunch of HTML documents, a user
performs a search, these documents are rendered in some research results in
their native HTML. I'd like to highlight the results but at this point I'd
be content just not screwing up the search results.

Things I've tried:

Field mapping, changing the default analyzer, using html_strip, setting
include_in_all to false, setting index to no and store to yes, etc.

What ends up happening in most cases is ES just ignores what I'm doing
(despite seeing the mapping correctly configured in the index), letting
searches like _all:<span work when that field shouldn't even be indexed nor
included in the _all field. In some cases I'd lose the
default analyzer and gain case sensitivity, so I stopped trying to do
this globally.

This seems like something that should be straight forward but I've wasted
a day on it. Any ideas? I've used the normal documentation, googled
various groups, etc. and just can't seem to get it to work.

Thanks for your time all.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

If you just want to store the HTML content and not search on it (i.e., not
indexed) then just set index=no and include_in_all=false in you mapping
properties.

-Eric

On Monday, April 22, 2013 5:56:00 PM UTC-4, Nathan Akeris wrote:

Thanks, that worked well. I guess I'm still confused as to what went
wrong before though - I'd like to use ES the way it's intended. Do these
features not work as described, or is there some sort of trick to getting
it to work? Has anyone else had these issues or is this type of request
normally straight forward?

Thanks again.

On Monday, April 22, 2013 5:28:37 PM UTC-4, David Pilato wrote:

I would probably store it as a binary BASE64 encoded content.
That way, it won't be touched anyhow by Elasticsearch.

My 2 cents.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr
| @scrutmydocs https://twitter.com/scrutmydocs

Le 22 avr. 2013 à 23:24, Nathan Akeris nathan...@gmail.com a écrit :

Hi all,

Been trying to figure this out and seem to be having no luck, so I
thought I'd throw my question up here.

Goal:

  • I want to be able to store HTML in a document, but not have it
    indexed. For example, I don't want searches for "<span style" to return
    results, but I want the raw HTML saved.

The use case is basically: We have a bunch of HTML documents, a user
performs a search, these documents are rendered in some research results in
their native HTML. I'd like to highlight the results but at this point I'd
be content just not screwing up the search results.

Things I've tried:

Field mapping, changing the default analyzer, using html_strip, setting
include_in_all to false, setting index to no and store to yes, etc.

What ends up happening in most cases is ES just ignores what I'm doing
(despite seeing the mapping correctly configured in the index), letting
searches like _all:<span work when that field shouldn't even be indexed nor
included in the _all field. In some cases I'd lose the
default analyzer and gain case sensitivity, so I stopped trying to do
this globally.

This seems like something that should be straight forward but I've wasted
a day on it. Any ideas? I've used the normal documentation, googled
various groups, etc. and just can't seem to get it to work.

Thanks for your time all.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

ES isn't very aware of html tags other than the char filter being able to
strip them.

You're going to run into problems with highlighting that I don't think can
be worked around. The highlighting doesn't pay attention to existing HTML
structures and so when highlighting tags are inserted they end up producing
invalid HTML nesting. Stuff like Match1Match2NoMatch

The way I deal with this is all html I index gets run through:
$clean_content = html_entity_decode( strip_tags( $content ) );

Then I use IDs attached to the ES documents in the search results to
retrieve the original html. When I'm doing highlighting I ignore the
original html structure. Not perfect, but it works pretty well and ensures
that html tags never cause a problem.

-Greg

On Monday, April 22, 2013 3:24:33 PM UTC-6, Nathan Akeris wrote:

Hi all,

Been trying to figure this out and seem to be having no luck, so I thought
I'd throw my question up here.

Goal:

  • I want to be able to store HTML in a document, but not have it indexed.
    For example, I don't want searches for "<span style" to return results,
    but I want the raw HTML saved.

The use case is basically: We have a bunch of HTML documents, a user
performs a search, these documents are rendered in some research results in
their native HTML. I'd like to highlight the results but at this point I'd
be content just not screwing up the search results.

Things I've tried:

Field mapping, changing the default analyzer, using html_strip, setting
include_in_all to false, setting index to no and store to yes, etc.

What ends up happening in most cases is ES just ignores what I'm doing
(despite seeing the mapping correctly configured in the index), letting
searches like _all:<span work when that field shouldn't even be indexed nor
included in the _all field. In some cases I'd lose the
default analyzer and gain case sensitivity, so I stopped trying to do
this globally.

This seems like something that should be straight forward but I've wasted
a day on it. Any ideas? I've used the normal documentation, googled
various groups, etc. and just can't seem to get it to work.

Thanks for your time all.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.