I searched this topic but some of the answers were still vague to me.
My goal is to index html docs but have the html stripped for the indexing,
at the same time, I would like _source to have the original html document
for display purposes.
//My doc format:
{
content: Hello this is an html content ....
rank:1
date:2014-8-8
title: Some title
....
}
The questions that I am still not very clear on:
1 - if I understand correctly, I can push html doc like it is to Index, and
it will strip html provided I do the charfilter referenced here?
2- Will the stripping not affect the _source? In other words, _source will
still have the original html?
3- Highlighting comes from the _source? this means highlighting will have
html, meaning I will have to strip any html tags after the search comes
back?
Also correct. The analysis chain only affects how the terms are indexed
and placed in the inverted index. The original document remains as is.
Not sure since I have never done highlighting. Highlighting might not
depend on the source since the term positions/offsets are used, but
hopefully someone will correct me.
I searched this topic but some of the answers were still vague to me.
My goal is to index html docs but have the html stripped for the indexing,
at the same time, I would like _source to have the original html document
for display purposes.
//My doc format:
{
content: Hello this is an html content ....
rank:1
date:2014-8-8
title: Some title
....
}
The questions that I am still not very clear on:
1 - if I understand correctly, I can push html doc like it is to Index,
and it will strip html provided I do the charfilter referenced here?
2- Will the stripping not affect the _source? In other words, _source will
still have the original html?
3- Highlighting comes from the _source? this means highlighting will have
html, meaning I will have to strip any html tags after the search comes
back?
//Index a document
PUT /twitter/tweet/1
{
"name" : "mike",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch, This is an html
test"
}
//Search for "Elasticsearch" yields html still
"fields": {
"message": [
"trying out Elasticsearch, This is an html
test"
]
On Wednesday, August 6, 2014 2:59:53 PM UTC-4, Ivan Brusic wrote:
Correct.
Also correct. The analysis chain only affects how the terms are indexed
and placed in the inverted index. The original document remains as is.
Not sure since I have never done highlighting. Highlighting might not
depend on the source since the term positions/offsets are used, but
hopefully someone will correct me.
--
Ivan
On Wed, Aug 6, 2014 at 11:45 AM, IronMike <sabda...@gmail.com
<javascript:>> wrote:
I searched this topic but some of the answers were still vague to me.
My goal is to index html docs but have the html stripped for the
indexing, at the same time, I would like _source to have the original html
document for display purposes.
//My doc format:
{
content: Hello this is an html content ....
rank:1
date:2014-8-8
title: Some title
....
}
The questions that I am still not very clear on:
1 - if I understand correctly, I can push html doc like it is to Index,
and it will strip html provided I do the charfilter referenced here?
2- Will the stripping not affect the _source? In other words, _source
will still have the original html?
3- Highlighting comes from the _source? this means highlighting will have
html, meaning I will have to strip any html tags after the search comes
back?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.