Html stripped highlighted text from html Content field


(vineeth mohan) #1

Hi ,

I have a content field which is html.
Is is possible to obtain the html stripped highlighted content ?

Thanks
Vineeth


(vineeth mohan) #2

Helooooo ,

Some help here would be greatly appreciated ....

Thanks
Vineeth

On Mon, Apr 16, 2012 at 11:39 AM, Vineeth Mohan
vineethmohan@algotree.comwrote:

Hi ,

I have a content field which is html.
Is is possible to obtain the html stripped highlighted content ?

Thanks
Vineeth


(Clinton Gormley) #3

On Tue, 2012-04-17 at 11:30 +0530, Vineeth Mohan wrote:

Helooooo ,

Some help here would be greatly appreciated ....

Have you tried typing 'html' into the search box on the website :wink:

Look at the first result that it suggests

clint

Thanks
Vineeth

On Mon, Apr 16, 2012 at 11:39 AM, Vineeth Mohan
vineethmohan@algotree.com wrote:
Hi ,

    I have a content field which is html.
    Is is possible to obtain the html stripped highlighted
    content ?
    
    Thanks
                 Vineeth

(vineeth mohan) #4

Yes , i did try this out -
http://www.elasticsearch.org/guide/reference/index-modules/analysis/htmlstrip-charfilter.html

But then html stripper only make sure that the html stripped text is given
for indexing.
It doesn’t make sure that when highlighted text is extracted from _source ,
the stripper is applied.

Thanks
Vineeth

On Tue, Apr 17, 2012 at 2:34 PM, Clinton Gormley clint@traveljury.comwrote:

On Tue, 2012-04-17 at 11:30 +0530, Vineeth Mohan wrote:

Helooooo ,

Some help here would be greatly appreciated ....

Have you tried typing 'html' into the search box on the website :wink:

Look at the first result that it suggests

clint

Thanks
Vineeth

On Mon, Apr 16, 2012 at 11:39 AM, Vineeth Mohan
vineethmohan@algotree.com wrote:
Hi ,

    I have a content field which is html.
    Is is possible to obtain the html stripped highlighted
    content ?

    Thanks
                 Vineeth

(vineeth mohan) #5

Currently am getting the highlighted html and stripping out useful text
from it.
It will help a lot if someone could point to a way in ES to do the same
more efficiently.

THanks
VIneeth

On Wed, Apr 18, 2012 at 5:34 PM, Vineeth Mohan vineethmohan@algotree.comwrote:

Yes , i did try this out -
http://www.elasticsearch.org/guide/reference/index-modules/analysis/htmlstrip-charfilter.html

But then html stripper only make sure that the html stripped text is given
for indexing.
It doesn’t make sure that when highlighted text is extracted from _source
, the stripper is applied.

Thanks
Vineeth

On Tue, Apr 17, 2012 at 2:34 PM, Clinton Gormley clint@traveljury.comwrote:

On Tue, 2012-04-17 at 11:30 +0530, Vineeth Mohan wrote:

Helooooo ,

Some help here would be greatly appreciated ....

Have you tried typing 'html' into the search box on the website :wink:

Look at the first result that it suggests

clint

Thanks
Vineeth

On Mon, Apr 16, 2012 at 11:39 AM, Vineeth Mohan
vineethmohan@algotree.com wrote:
Hi ,

    I have a content field which is html.
    Is is possible to obtain the html stripped highlighted
    content ?

    Thanks
                 Vineeth

(Shay Banon) #6

The html stripping part only applies during tokenization, so the
highlighting will get you back the actual HTML content. You will need to
strip HTML yourself if you want that behavior.

On Fri, Apr 20, 2012 at 4:47 PM, Vineeth Mohan vineethmohan@algotree.comwrote:

Currently am getting the highlighted html and stripping out useful text
from it.
It will help a lot if someone could point to a way in ES to do the same
more efficiently.

THanks
VIneeth

On Wed, Apr 18, 2012 at 5:34 PM, Vineeth Mohan vineethmohan@algotree.comwrote:

Yes , i did try this out -
http://www.elasticsearch.org/guide/reference/index-modules/analysis/htmlstrip-charfilter.html

But then html stripper only make sure that the html stripped text is
given for indexing.
It doesn’t make sure that when highlighted text is extracted from _source
, the stripper is applied.

Thanks
Vineeth

On Tue, Apr 17, 2012 at 2:34 PM, Clinton Gormley clint@traveljury.comwrote:

On Tue, 2012-04-17 at 11:30 +0530, Vineeth Mohan wrote:

Helooooo ,

Some help here would be greatly appreciated ....

Have you tried typing 'html' into the search box on the website :wink:

Look at the first result that it suggests

clint

Thanks
Vineeth

On Mon, Apr 16, 2012 at 11:39 AM, Vineeth Mohan
vineethmohan@algotree.com wrote:
Hi ,

    I have a content field which is html.
    Is is possible to obtain the html stripped highlighted
    content ?

    Thanks
                 Vineeth

(vineeth mohan) #7

Thanks Shay ,

But then , i have to see if we use attachment type , then we should be able
to achieve this behavior .

Thanks
Vineeth

On Sat, Apr 21, 2012 at 8:38 PM, Shay Banon kimchy@gmail.com wrote:

The html stripping part only applies during tokenization, so the
highlighting will get you back the actual HTML content. You will need to
strip HTML yourself if you want that behavior.

On Fri, Apr 20, 2012 at 4:47 PM, Vineeth Mohan vineethmohan@algotree.comwrote:

Currently am getting the highlighted html and stripping out useful text
from it.
It will help a lot if someone could point to a way in ES to do the same
more efficiently.

THanks
VIneeth

On Wed, Apr 18, 2012 at 5:34 PM, Vineeth Mohan <vineethmohan@algotree.com

wrote:

Yes , i did try this out -
http://www.elasticsearch.org/guide/reference/index-modules/analysis/htmlstrip-charfilter.html

But then html stripper only make sure that the html stripped text is
given for indexing.
It doesn’t make sure that when highlighted text is extracted from
_source , the stripper is applied.

Thanks
Vineeth

On Tue, Apr 17, 2012 at 2:34 PM, Clinton Gormley clint@traveljury.comwrote:

On Tue, 2012-04-17 at 11:30 +0530, Vineeth Mohan wrote:

Helooooo ,

Some help here would be greatly appreciated ....

Have you tried typing 'html' into the search box on the website :wink:

Look at the first result that it suggests

clint

Thanks
Vineeth

On Mon, Apr 16, 2012 at 11:39 AM, Vineeth Mohan
vineethmohan@algotree.com wrote:
Hi ,

    I have a content field which is html.
    Is is possible to obtain the html stripped highlighted
    content ?

    Thanks
                 Vineeth

(pellyadolfo) #8

Hi, I searched documentation and internet but could not find any accurate
information on this.

I have a highlight query which is working properly:

SearchResponse response = getClient().prepareSearch()
.setIndices("myindex")
.setTypes("mytype")
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.setQuery(QueryBuilders
.boolQuery()
.should(QueryBuilders.matchQuery("myfield", "house"))
)
.addHighlightedField("myfield", 250, 1)
.setFrom(0)
.setSize(25)
.execute()
.actionGet();

The query is fetching results from myfield which contains indexed HTML
content. Highlighted result contains HTML tags and would like to trip out
the HTML content response. I found the HTML Strip Char Filterhttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html but
do not know what is the syntax to add it as a request analyzer in Java.

I have found examples in Java to create indices including the analyzerhttp://jaibeermalik.wordpress.com/2013/03/26/elasticsearch-text-analysis-for-content-enrichment/ but
none to include the analyzer in a java request which documentations sayshttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis.htmlis possible:

The index analysis module acts as a configurable registry of Analyzers
that can be used in order to both break indexed (analyzed) fields when a
document is indexed and process query strings

Any pointer to an example would be very appreciated.

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d83716b0-1461-4796-9d03-b7d7cb268ef7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alexander Reelsen) #9

Hey,

you could use the analyze API and the char_filter to get the extract text
back in parts, see https://gist.github.com/clintongormley/780895
However elasticsearch does not store the text without the HTML somewhere as
a complete block, which you could read out. If you want to do that, you
would need to do it before indexing.

The char_filter is basically to make sure that a search for 'title' will
not include any web page which contains a '' tag.

Not a hundred percent sure if this was your question, so feel free to ask
further and where I might have misunderstood you.

--Alex

On Thu, Dec 19, 2013 at 8:55 PM, Adolfo Rodriguez pellyadolfo@yahoo.eswrote:

Hi, I searched documentation and internet but could not find any accurate
information on this.

I have a highlight query which is working properly:

SearchResponse response = getClient().prepareSearch()
.setIndices("myindex")
.setTypes("mytype")
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.setQuery(QueryBuilders
.boolQuery()
.should(QueryBuilders.matchQuery("myfield", "house"))
)
.addHighlightedField("myfield", 250, 1)
.setFrom(0)
.setSize(25)
.execute()
.actionGet();

The query is fetching results from myfield which contains indexed HTML
content. Highlighted result contains HTML tags and would like to trip out
the HTML content response. I found the HTML Strip Char Filterhttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html but
do not know what is the syntax to add it as a request analyzer in Java.

I have found examples in Java to create indices including the analyzerhttp://jaibeermalik.wordpress.com/2013/03/26/elasticsearch-text-analysis-for-content-enrichment/ but
none to include the analyzer in a java request which documentations sayshttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis.htmlis possible:

The index analysis module acts as a configurable registry of Analyzers
that can be used in order to both break indexed (analyzed) fields when a
document is indexed and process query strings

Any pointer to an example would be very appreciated.

Thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d83716b0-1461-4796-9d03-b7d7cb268ef7%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM840fyahaGQhXQR0nfWf0Y9z8kSXEQJbVETi6rb6R5tdg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #10