Stripping html for indexing only?


(IronMike) #1

I searched this topic but some of the answers were still vague to me.

My goal is to index html docs but have the html stripped for the indexing,
at the same time, I would like _source to have the original html document
for display purposes.

//My doc format:
{
content: Hello this is an html content ....
rank:1
date:2014-8-8
title: Some title
....
}

The questions that I am still not very clear on:

1 - if I understand correctly, I can push html doc like it is to Index, and
it will strip html provided I do the charfilter referenced here?

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html

2- Will the stripping not affect the _source? In other words, _source will
still have the original html?

3- Highlighting comes from the _source? this means highlighting will have
html, meaning I will have to strip any html tags after the search comes
back?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6be77d25-f7fe-4a35-a247-932f93f07150%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #2
  1. Correct.
  2. Also correct. The analysis chain only affects how the terms are indexed
    and placed in the inverted index. The original document remains as is.
  3. Not sure since I have never done highlighting. Highlighting might not
    depend on the source since the term positions/offsets are used, but
    hopefully someone will correct me.

--
Ivan

On Wed, Aug 6, 2014 at 11:45 AM, IronMike sabdalla80@gmail.com wrote:

I searched this topic but some of the answers were still vague to me.

My goal is to index html docs but have the html stripped for the indexing,
at the same time, I would like _source to have the original html document
for display purposes.

//My doc format:
{
content: Hello this is an html content ....
rank:1
date:2014-8-8
title: Some title
....
}

The questions that I am still not very clear on:

1 - if I understand correctly, I can push html doc like it is to Index,
and it will strip html provided I do the charfilter referenced here?

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html

2- Will the stripping not affect the _source? In other words, _source will
still have the original html?

3- Highlighting comes from the _source? this means highlighting will have
html, meaning I will have to strip any html tags after the search comes
back?

Thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6be77d25-f7fe-4a35-a247-932f93f07150%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6be77d25-f7fe-4a35-a247-932f93f07150%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQBfhWBqtfi0zfPvmYs9ytT-bz75U8vCsuuUo3GVvLugpA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(IronMike) #3

Thanks. I tried a simple example below and it doesn't seem to strip html,
What's missing?

//yml
index :
analysis :
analyzer :
messageAnalyzer :
type : custom
tokenizer : standard
filter : standard
char_filter : [my_html]
char_filter :
my_html :
type : html_strip
read_ahead : 1024

//Create Index
PUT /twitter
{
"mappings": {
"message" : {
"properties" : {
"message" : {
"type" : "string",
"analyzer": "messageAnalyzer"
},
"date" : {
"type" : "date"
},
"name" : {
"type" : "string"
}
}
}
}
}

//Index a document
PUT /twitter/tweet/1
{
"name" : "mike",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch, This is an html
test"
}

//Search for "ElasticSearch" yields html still
"fields": {
"message": [
"trying out Elasticsearch, This is an html
test"
]

On Wednesday, August 6, 2014 2:59:53 PM UTC-4, Ivan Brusic wrote:

  1. Correct.
  2. Also correct. The analysis chain only affects how the terms are indexed
    and placed in the inverted index. The original document remains as is.
  3. Not sure since I have never done highlighting. Highlighting might not
    depend on the source since the term positions/offsets are used, but
    hopefully someone will correct me.

--
Ivan

On Wed, Aug 6, 2014 at 11:45 AM, IronMike <sabda...@gmail.com
<javascript:>> wrote:

I searched this topic but some of the answers were still vague to me.

My goal is to index html docs but have the html stripped for the
indexing, at the same time, I would like _source to have the original html
document for display purposes.

//My doc format:
{
content: Hello this is an html content ....
rank:1
date:2014-8-8
title: Some title
....
}

The questions that I am still not very clear on:

1 - if I understand correctly, I can push html doc like it is to Index,
and it will strip html provided I do the charfilter referenced here?

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html

2- Will the stripping not affect the _source? In other words, _source
will still have the original html?

3- Highlighting comes from the _source? this means highlighting will have
html, meaning I will have to strip any html tags after the search comes
back?

Thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6be77d25-f7fe-4a35-a247-932f93f07150%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6be77d25-f7fe-4a35-a247-932f93f07150%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/87574845-6195-4904-bb1f-d8e9c662c177%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4