How to get char_filter to work?

Ivan,

A followup question, As I mentioned earlier storing html and applying
char-filter doesn't really work especially with highlighted fields coming
back with weird html display.
So, I am thinking stripping html before indexing, so no html in index and
source, but I will add an extra field like "html_content" which meant to
store the html version and not be indexed.
Do you see any problems with my approach? I see one like big index size.
What do you recommend for an ideal solution? I am still confused as I
thought this would be a common problem?

On Friday, August 8, 2014 8:16:09 PM UTC-4, IronMan wrote:

Thanks again. I wasn't expecting it to remove what's between the tags. I
believe I understand the behavior and maybe its the case where I was greedy
and expecting Elasticsearch to do it all.
Here is a scenario that I was looking for: Assume I am looking to get an
excerpt of text (Extracted text from a document), Elastic Search query will
give me excerpt with html tags, but the tags are out of context, so I would
have liked to be to display this excerpt with no html tags, I know I can
probably strip the tags after the fact, but that's what I was trying to
avoid. In other words, in a perfect world, I would have liked 2 versions
of the document, the original html one and another stripped one. When I
need to query things like excerpts, I would query the stripped one, and
when I needed the html, I would query the source. Hopefully I didn't make
this more confusing.

On Friday, August 8, 2014 4:58:03 PM UTC-4, Ivan Brusic wrote:

The tokens that appear in the analyze API are the ones that are put into
the inverted index. When you search for one of the terms that is not an
HTML tag, there will be a match. What I don't understand after reading in
detail your original, is exactly what behavior you are expecting.

You indexed the phrase

trying out Elasticsearch, This is an html test

but you expected a query for the term "html" to not match. However, the
work "html" is clearly in the content. The html stripper will not remove
the contents in between the tags, just the tags themselve. The analyze API
should show you the correct term.

Lucene has more control over what information you can retrieve, but the
only way to get the analyzed token stream back from Elasticsearch is to use
the analyze API on the field. Most people do not want an analyzed token
stream, just the original field.

--
Ivan

On Fri, Aug 8, 2014 at 12:01 PM, IronMike sabda...@gmail.com wrote:

Also, Here is a link for someone who had the same problem, I am not sure
if there was a final answer to that one.
http://grokbase.com/t/gg/elasticsearch/126r4kv8tx/problem-with-standard-html-strip
,
I have to admit that I am a bit confused now about this topic. I
understand analyzers will tokenize the sentence and strip html in the case
of the html_strip, and _analyze works fine using the analyzer, what I am
failing to understand, is how can I get the results of these tokens. Isn't
the whole idea to be able to search for them tokens eventually?

If not, whats the solution of what I would think is a common scenario,
having to index html documents, where html tags don't need to be indexed,
while keeping the original html for presentational purpose? Any ideas
(Besides having to strip html tags manually before indexing?

On Friday, August 8, 2014 1:02:07 PM UTC-4, IronMike wrote:

Thanks for explaining. So, is there a way to be able to get non html
from the index? I thought I read that it was possible to index without the
html tags while keeping source intact. So, how would I get at the index
with non html tags if you will?

On Friday, August 8, 2014 12:52:37 PM UTC-4, Ivan Brusic wrote:

The field is derived from the source and not generated from the tokens.

If we indexed the sentence "The quick brown foxes jumped over the lazy
dogs" with the english analyzer, the tokens would be

http://localhost:9200/_analyze?text=The%20quick%
20brown%20foxes%20jumped%20over%20the%20lazy%20dogs&analyzer=english

quick brown fox jump over lazi dog

After applying stopwords and stemming, the tokens do not form a
sentence that looks like the original.

--
Ivan

On Fri, Aug 8, 2014 at 9:42 AM, IronMike sabda...@gmail.com wrote:

Ivan,

The search results I am showing is for the field "title" not for the
source. I thought I could query the field not the source and look at it
with no html while the source was intact. Did I misunderstand?

On Friday, August 8, 2014 12:36:16 PM UTC-4, Ivan Brusic wrote:

The analyzers control how text is parsed/tokenized and how terms are
indexed in the inverted index. The source document remains untouched.

--
Ivan

On Fri, Aug 8, 2014 at 9:24 AM, IronMike sabda...@gmail.com wrote:

I also used Clint's example and tried to map it to a document and
search the field, but still getting html in query results... Here is my
code. I appreciate the help.

//Tokenizer

PUT /foo/
{
"settings": {
"index" : {
"analysis" : {
"analyzer" : {
"test_1" : {
"char_filter" : [
"html_strip"
],
"tokenizer" : "standard"
}
}
}
}
}
}

//Mapping
PUT /foo/foo_type/_mapping
{
"foo_type":{
"properties" : {
"title": {
"type":"string",
"index": "analyzed",
"analyzer":"test_1"
}
}
}
}

Get /foo/foo_type/_mapping
{
"foo": {
"mappings": {
"foo_type": {
"properties": {
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"title": {
"type": "string",
"analyzer": "test_1"
}
}
}
}
}
}

////Index/////////////
PUT /foo/foo_type/1
{
"date" : "2009-11-15T14:12:12",
"title" : "The quick & brown fox"
}

//Search //////////
GET /foo/_search?pretty:true
{
"fields": ["title"],
"query": {
"query_string": {
"query": "brown",
"analyzer": "test_1"
}
}
}

//Results showing html tags still//////
"hits": [
{
"_index": "foo",
"_type": "foo_type",
"_id": "1",
"_score": 0.076713204,
"fields": {
"title": [
"The quick & brown fox"
]
}

On Thursday, August 7, 2014 6:06:56 PM UTC-4, Jörg Prante wrote:

Have you checked Clint's example?

HTML Strip charfilter test for ElasticSearch · GitHub

Jörg

On Thu, Aug 7, 2014 at 8:23 PM, IronMike sabda...@gmail.com
wrote:

I would like to strip html tags for indexing. Here is a simple
example I tried so far, but doesn't seem to strip html tags. Any ideas
what's missing?

//settings & Mappings
POST twitter
{
"mappings": {
"tweet" : {
"properties" : {
"message" : {
"type" : "string",
"analyzer": "strip_html_analyzer"
},
"date" : {
"type" : "date"
},
"name" : {
"type" : "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"strip_html_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":"standard",
"char_filter":"my_html"
}
},
"char_filter": {
"my_html":{
"type":"html_strip"
}
}
}
}
}

//Index a document
PUT /twitter/tweet/1
{
"name" : "mike",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch, This is
an html test"
}

//query result for "html", I expect the query to return nothing
since it is supposed to strip the tag?
"hits": {
"total": 1,
"max_score": 0.11626227,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 0.11626227,
"fields": {
"message": [
"trying out Elasticsearch, This is
an html test"
]
},
"highlight": {
"message": [
"trying out Elasticsearch, This is
an html test"
]
}
}
]
}

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b3
8-4646-bc8f-a27896454515%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47
c-4c35-a40b-058e3c1b1043%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/28cbd510-d31c-4ab1-bd4a-6a87eade7953%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.