How to get char_filter to work?

IronMike · August 7, 2014, 6:23pm

I would like to strip html tags for indexing. Here is a simple example I
tried so far, but doesn't seem to strip html tags. Any ideas what's missing?

//settings & Mappings
POST twitter
{
"mappings": {
"tweet" : {
"properties" : {
"message" : {
"type" : "string",
"analyzer": "strip_html_analyzer"
},
"date" : {
"type" : "date"
},
"name" : {
"type" : "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"strip_html_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":"standard",
"char_filter":"my_html"
}
},
"char_filter": {
"my_html":{
"type":"html_strip"
}
}
}
}
}

//Index a document
PUT /twitter/tweet/1
{
"name" : "mike",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch, This is an html
test"
}

//query result for "html", I expect the query to return nothing since it is
supposed to strip the tag?
"hits": {
"total": 1,
"max_score": 0.11626227,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 0.11626227,
"fields": {
"message": [
"trying out Elasticsearch, This is an html
test"
]
},
"highlight": {
"message": [
"trying out Elasticsearch, This is an
html test"
]
}
}
]
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · August 7, 2014, 10:06pm

Have you checked Clint's example?

gist.github.com

https://gist.github.com/clintongormley/780895

html_strip.sh

# Analyze text: "the <b>quick</b> bröwn <img src="fox"/> &quot;jumped&quot;"

curl -XPUT 'http://127.0.0.1:9200/foo/'  -d '
{
   "index" : {
      "analysis" : {
         "analyzer" : {
            "test_1" : {
               "char_filter" : [
                  "html_strip"

This file has been truncated. show original

Jörg

On Thu, Aug 7, 2014 at 8:23 PM, IronMike sabdalla80@gmail.com wrote:

I would like to strip html tags for indexing. Here is a simple example I
tried so far, but doesn't seem to strip html tags. Any ideas what's missing?

//settings & Mappings
POST twitter
{
"mappings": {
"tweet" : {
"properties" : {
"message" : {
"type" : "string",
"analyzer": "strip_html_analyzer"
},
"date" : {
"type" : "date"
},
"name" : {
"type" : "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"strip_html_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":"standard",
"char_filter":"my_html"
}
},
"char_filter": {
"my_html":{
"type":"html_strip"
}
}
}
}
}

//Index a document
PUT /twitter/tweet/1
{
"name" : "mike",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch, This is an html
test"
}

//query result for "html", I expect the query to return nothing since it
is supposed to strip the tag?
"hits": {
"total": 1,
"max_score": 0.11626227,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 0.11626227,
"fields": {
"message": [
"trying out Elasticsearch, This is an html
test"
]
},
"highlight": {
"message": [
"trying out Elasticsearch, This is an
html test"
]
}
}
]
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFw%3DePvwyZoSCQ8V26VMg91LATrKeuUyHjkc293QBoTpA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

IronMike · August 8, 2014, 2:59pm

Thanks for the reply. The _analyze command in the link you sent works and
it also works on my example above. However, how comes does the field in
search results come back with html tag still? Am I missing something?

On Thursday, August 7, 2014 6:06:56 PM UTC-4, Jörg Prante wrote:

Have you checked Clint's example?

HTML Strip charfilter test for ElasticSearch · GitHub

Jörg

On Thu, Aug 7, 2014 at 8:23 PM, IronMike <sabda...@gmail.com <javascript:>

wrote:

I would like to strip html tags for indexing. Here is a simple example I
tried so far, but doesn't seem to strip html tags. Any ideas what's missing?

//settings & Mappings
POST twitter
{
"mappings": {
"tweet" : {
"properties" : {
"message" : {
"type" : "string",
"analyzer": "strip_html_analyzer"
},
"date" : {
"type" : "date"
},
"name" : {
"type" : "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"strip_html_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":"standard",
"char_filter":"my_html"
}
},
"char_filter": {
"my_html":{
"type":"html_strip"
}
}
}
}
}

//Index a document
PUT /twitter/tweet/1
{
"name" : "mike",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch, This is an html
test"
}

//query result for "html", I expect the query to return nothing since it
is supposed to strip the tag?
"hits": {
"total": 1,
"max_score": 0.11626227,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 0.11626227,
"fields": {
"message": [
"trying out Elasticsearch, This is an html
test"
]
},
"highlight": {
"message": [
"trying out Elasticsearch, This is an
html test"
]
}
}
]
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/41657b93-dd56-482b-aa62-1f8dbf589aa1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

IronMike · August 8, 2014, 3:08pm

Thanks for the reply. The example in the link works for me, when I use the
_analyze command with my analyzer in my example also works fine, so this
means the analyzer is working fine. So, What am I missing in the example
above that I am still getting the html tags when I index a document with
html?

On Thursday, August 7, 2014 6:06:56 PM UTC-4, Jörg Prante wrote:

Have you checked Clint's example?

HTML Strip charfilter test for ElasticSearch · GitHub

Jörg

On Thu, Aug 7, 2014 at 8:23 PM, IronMike <sabda...@gmail.com <javascript:>

wrote:

I would like to strip html tags for indexing. Here is a simple example I
tried so far, but doesn't seem to strip html tags. Any ideas what's missing?

//settings & Mappings
POST twitter
{
"mappings": {
"tweet" : {
"properties" : {
"message" : {
"type" : "string",
"analyzer": "strip_html_analyzer"
},
"date" : {
"type" : "date"
},
"name" : {
"type" : "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"strip_html_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":"standard",
"char_filter":"my_html"
}
},
"char_filter": {
"my_html":{
"type":"html_strip"
}
}
}
}
}

//Index a document
PUT /twitter/tweet/1
{
"name" : "mike",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch, This is an html
test"
}

//query result for "html", I expect the query to return nothing since it
is supposed to strip the tag?
"hits": {
"total": 1,
"max_score": 0.11626227,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 0.11626227,
"fields": {
"message": [
"trying out Elasticsearch, This is an html
test"
]
},
"highlight": {
"message": [
"trying out Elasticsearch, This is an
html test"
]
}
}
]
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e355e561-54ef-4f42-a2d8-c2bb3bba08c2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

IronMike · August 8, 2014, 4:24pm

I also used Clint's example and tried to map it to a document and search
the field, but still getting html in query results... Here is my code. I
appreciate the help.

//Tokenizer

PUT /foo/
{
"settings": {
"index" : {
"analysis" : {
"analyzer" : {
"test_1" : {
"char_filter" : [
"html_strip"
],
"tokenizer" : "standard"
}
}
}
}
}
}

//Mapping
PUT /foo/foo_type/_mapping
{
"foo_type":{
"properties" : {
"title": {
"type":"string",
"index": "analyzed",
"analyzer":"test_1"
}
}
}
}

Get /foo/foo_type/_mapping
{
"foo": {
"mappings": {
"foo_type": {
"properties": {
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"title": {
"type": "string",
"analyzer": "test_1"
}
}
}
}
}
}

////Index/////////////
PUT /foo/foo_type/1
{
"date" : "2009-11-15T14:12:12",
"title" : "The quick & brown fox"
}

//Search //////////
GET /foo/_search?pretty:true
{
"fields": ["title"],
"query": {
"query_string": {
"query": "brown",
"analyzer": "test_1"
}
}
}

//Results showing html tags still//////
"hits": [
{
"_index": "foo",
"_type": "foo_type",
"_id": "1",
"_score": 0.076713204,
"fields": {
"title": [
"The quick & brown fox"
]
}

On Thursday, August 7, 2014 6:06:56 PM UTC-4, Jörg Prante wrote:

Have you checked Clint's example?

HTML Strip charfilter test for ElasticSearch · GitHub

Jörg

On Thu, Aug 7, 2014 at 8:23 PM, IronMike <sabda...@gmail.com <javascript:>

wrote:

I would like to strip html tags for indexing. Here is a simple example I
tried so far, but doesn't seem to strip html tags. Any ideas what's missing?

//settings & Mappings
POST twitter
{
"mappings": {
"tweet" : {
"properties" : {
"message" : {
"type" : "string",
"analyzer": "strip_html_analyzer"
},
"date" : {
"type" : "date"
},
"name" : {
"type" : "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"strip_html_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":"standard",
"char_filter":"my_html"
}
},
"char_filter": {
"my_html":{
"type":"html_strip"
}
}
}
}
}

//Index a document
PUT /twitter/tweet/1
{
"name" : "mike",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch, This is an html
test"
}

//query result for "html", I expect the query to return nothing since it
is supposed to strip the tag?
"hits": {
"total": 1,
"max_score": 0.11626227,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 0.11626227,
"fields": {
"message": [
"trying out Elasticsearch, This is an html
test"
]
},
"highlight": {
"message": [
"trying out Elasticsearch, This is an
html test"
]
}
}
]
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · August 8, 2014, 4:36pm

The analyzers control how text is parsed/tokenized and how terms are
indexed in the inverted index. The source document remains untouched.

--
Ivan

On Fri, Aug 8, 2014 at 9:24 AM, IronMike sabdalla80@gmail.com wrote:

I also used Clint's example and tried to map it to a document and search
the field, but still getting html in query results... Here is my code. I
appreciate the help.

//Tokenizer

PUT /foo/
{
"settings": {
"index" : {
"analysis" : {
"analyzer" : {
"test_1" : {
"char_filter" : [
"html_strip"
],
"tokenizer" : "standard"
}
}
}
}
}
}

//Mapping
PUT /foo/foo_type/_mapping
{
"foo_type":{
"properties" : {
"title": {
"type":"string",
"index": "analyzed",
"analyzer":"test_1"
}
}
}
}

Get /foo/foo_type/_mapping
{
"foo": {
"mappings": {
"foo_type": {
"properties": {
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"title": {
"type": "string",
"analyzer": "test_1"
}
}
}
}
}
}

////Index/////////////
PUT /foo/foo_type/1
{
"date" : "2009-11-15T14:12:12",
"title" : "The quick & brown fox"
}

//Search //////////
GET /foo/_search?pretty:true
{
"fields": ["title"],
"query": {
"query_string": {
"query": "brown",
"analyzer": "test_1"
}
}
}

//Results showing html tags still//////
"hits": [
{
"_index": "foo",
"_type": "foo_type",
"_id": "1",
"_score": 0.076713204,
"fields": {
"title": [
"The quick & brown fox"
]
}

On Thursday, August 7, 2014 6:06:56 PM UTC-4, Jörg Prante wrote:

Have you checked Clint's example?

HTML Strip charfilter test for ElasticSearch · GitHub

Jörg

On Thu, Aug 7, 2014 at 8:23 PM, IronMike sabda...@gmail.com wrote:

I would like to strip html tags for indexing. Here is a simple example I
tried so far, but doesn't seem to strip html tags. Any ideas what's missing?

//settings & Mappings
POST twitter
{
"mappings": {
"tweet" : {
"properties" : {
"message" : {
"type" : "string",
"analyzer": "strip_html_analyzer"
},
"date" : {
"type" : "date"
},
"name" : {
"type" : "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"strip_html_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":"standard",
"char_filter":"my_html"
}
},
"char_filter": {
"my_html":{
"type":"html_strip"
}
}
}
}
}

//Index a document
PUT /twitter/tweet/1
{
"name" : "mike",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch, This is an html
test"
}

//query result for "html", I expect the query to return nothing since it
is supposed to strip the tag?
"hits": {
"total": 1,
"max_score": 0.11626227,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 0.11626227,
"fields": {
"message": [
"trying out Elasticsearch, This is an
html test"
]
},
"highlight": {
"message": [
"trying out Elasticsearch, This is an
html test"
]
}
}
]
}

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQA9ky_57g0YU55mfvsM0VqPVhA9dvYLeAmxiwWi32_Eqw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

IronMike · August 8, 2014, 4:42pm

Ivan,

The search results I am showing is for the field "title" not for the
source. I thought I could query the field not the source and look at it
with no html while the source was intact. Did I misunderstand?

On Friday, August 8, 2014 12:36:16 PM UTC-4, Ivan Brusic wrote:

The analyzers control how text is parsed/tokenized and how terms are
indexed in the inverted index. The source document remains untouched.

--
Ivan

On Fri, Aug 8, 2014 at 9:24 AM, IronMike <sabda...@gmail.com <javascript:>

wrote:

I also used Clint's example and tried to map it to a document and search
the field, but still getting html in query results... Here is my code. I
appreciate the help.

//Tokenizer

PUT /foo/
{
"settings": {
"index" : {
"analysis" : {
"analyzer" : {
"test_1" : {
"char_filter" : [
"html_strip"
],
"tokenizer" : "standard"
}
}
}
}
}
}

//Mapping
PUT /foo/foo_type/_mapping
{
"foo_type":{
"properties" : {
"title": {
"type":"string",
"index": "analyzed",
"analyzer":"test_1"
}
}
}
}

Get /foo/foo_type/_mapping
{
"foo": {
"mappings": {
"foo_type": {
"properties": {
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"title": {
"type": "string",
"analyzer": "test_1"
}
}
}
}
}
}

////Index/////////////
PUT /foo/foo_type/1
{
"date" : "2009-11-15T14:12:12",
"title" : "The quick & brown fox"
}

//Search //////////
GET /foo/_search?pretty:true
{
"fields": ["title"],
"query": {
"query_string": {
"query": "brown",
"analyzer": "test_1"
}
}
}

//Results showing html tags still//////
"hits": [
{
"_index": "foo",
"_type": "foo_type",
"_id": "1",
"_score": 0.076713204,
"fields": {
"title": [
"The quick & brown fox"
]
}

On Thursday, August 7, 2014 6:06:56 PM UTC-4, Jörg Prante wrote:

Have you checked Clint's example?

HTML Strip charfilter test for ElasticSearch · GitHub

Jörg

On Thu, Aug 7, 2014 at 8:23 PM, IronMike sabda...@gmail.com wrote:

I would like to strip html tags for indexing. Here is a simple
example I tried so far, but doesn't seem to strip html tags. Any ideas
what's missing?

//settings & Mappings
POST twitter
{
"mappings": {
"tweet" : {
"properties" : {
"message" : {
"type" : "string",
"analyzer": "strip_html_analyzer"
},
"date" : {
"type" : "date"
},
"name" : {
"type" : "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"strip_html_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":"standard",
"char_filter":"my_html"
}
},
"char_filter": {
"my_html":{
"type":"html_strip"
}
}
}
}
}

//Index a document
PUT /twitter/tweet/1
{
"name" : "mike",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch, This is an html
test"
}

//query result for "html", I expect the query to return nothing since
it is supposed to strip the tag?
"hits": {
"total": 1,
"max_score": 0.11626227,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 0.11626227,
"fields": {
"message": [
"trying out Elasticsearch, This is an
html test"
]
},
"highlight": {
"message": [
"trying out Elasticsearch, This is an
html test"
]
}
}
]
}

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · August 8, 2014, 4:52pm

The field is derived from the source and not generated from the tokens.

If we indexed the sentence "The quick brown foxes jumped over the lazy
dogs" with the english analyzer, the tokens would be

http://localhost:9200/_analyze?text=The%20quick%20brown%20foxes%20jumped%20over%20the%20lazy%20dogs&analyzer=english

quick brown fox jump over lazi dog

After applying stopwords and stemming, the tokens do not form a sentence
that looks like the original.

--
Ivan

On Fri, Aug 8, 2014 at 9:42 AM, IronMike sabdalla80@gmail.com wrote:

Ivan,

The search results I am showing is for the field "title" not for the
source. I thought I could query the field not the source and look at it
with no html while the source was intact. Did I misunderstand?

On Friday, August 8, 2014 12:36:16 PM UTC-4, Ivan Brusic wrote:

The analyzers control how text is parsed/tokenized and how terms are
indexed in the inverted index. The source document remains untouched.

--
Ivan

On Fri, Aug 8, 2014 at 9:24 AM, IronMike sabda...@gmail.com wrote:

I also used Clint's example and tried to map it to a document and search
the field, but still getting html in query results... Here is my code. I
appreciate the help.

//Tokenizer

PUT /foo/
{
"settings": {
"index" : {
"analysis" : {
"analyzer" : {
"test_1" : {
"char_filter" : [
"html_strip"
],
"tokenizer" : "standard"
}
}
}
}
}
}

//Mapping
PUT /foo/foo_type/_mapping
{
"foo_type":{
"properties" : {
"title": {
"type":"string",
"index": "analyzed",
"analyzer":"test_1"
}
}
}
}

Get /foo/foo_type/_mapping
{
"foo": {
"mappings": {
"foo_type": {
"properties": {
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"title": {
"type": "string",
"analyzer": "test_1"
}
}
}
}
}
}

////Index/////////////
PUT /foo/foo_type/1
{
"date" : "2009-11-15T14:12:12",
"title" : "The quick & brown fox"
}

//Search //////////
GET /foo/_search?pretty:true
{
"fields": ["title"],
"query": {
"query_string": {
"query": "brown",
"analyzer": "test_1"
}
}
}

//Results showing html tags still//////
"hits": [
{
"_index": "foo",
"_type": "foo_type",
"_id": "1",
"_score": 0.076713204,
"fields": {
"title": [
"The quick & brown fox"
]
}

On Thursday, August 7, 2014 6:06:56 PM UTC-4, Jörg Prante wrote:

Have you checked Clint's example?

HTML Strip charfilter test for ElasticSearch · GitHub

Jörg

On Thu, Aug 7, 2014 at 8:23 PM, IronMike sabda...@gmail.com wrote:

I would like to strip html tags for indexing. Here is a simple
example I tried so far, but doesn't seem to strip html tags. Any ideas
what's missing?

//settings & Mappings
POST twitter
{
"mappings": {
"tweet" : {
"properties" : {
"message" : {
"type" : "string",
"analyzer": "strip_html_analyzer"
},
"date" : {
"type" : "date"
},
"name" : {
"type" : "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"strip_html_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":"standard",
"char_filter":"my_html"
}
},
"char_filter": {
"my_html":{
"type":"html_strip"
}
}
}
}
}

//Index a document
PUT /twitter/tweet/1
{
"name" : "mike",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch, This is an
html test"
}

//query result for "html", I expect the query to return nothing since
it is supposed to strip the tag?
"hits": {
"total": 1,
"max_score": 0.11626227,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 0.11626227,
"fields": {
"message": [
"trying out Elasticsearch, This is an
html test"
]
},
"highlight": {
"message": [
"trying out Elasticsearch, This is an
html test"
]
}
}
]
}

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQBoHaBYEC-_xygEGkNZcy1-sx_RV_Xcx%2BEyx6bDi8%3D_nw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

IronMike · August 8, 2014, 5:02pm

Thanks for explaining. So, is there a way to be able to get non html from
the index? I thought I read that it was possible to index without the html
tags while keeping source intact. So, how would I get at the index with non
html tags if you will?

On Friday, August 8, 2014 12:52:37 PM UTC-4, Ivan Brusic wrote:

The field is derived from the source and not generated from the tokens.

If we indexed the sentence "The quick brown foxes jumped over the lazy
dogs" with the english analyzer, the tokens would be

http://localhost:9200/_analyze?text=The%20quick%20brown%20foxes%20jumped%20over%20the%20lazy%20dogs&analyzer=english

quick brown fox jump over lazi dog

After applying stopwords and stemming, the tokens do not form a sentence
that looks like the original.

--
Ivan

On Fri, Aug 8, 2014 at 9:42 AM, IronMike <sabda...@gmail.com <javascript:>

wrote:

Ivan,

The search results I am showing is for the field "title" not for the
source. I thought I could query the field not the source and look at it
with no html while the source was intact. Did I misunderstand?

On Friday, August 8, 2014 12:36:16 PM UTC-4, Ivan Brusic wrote:

The analyzers control how text is parsed/tokenized and how terms are
indexed in the inverted index. The source document remains untouched.

--
Ivan

On Fri, Aug 8, 2014 at 9:24 AM, IronMike sabda...@gmail.com wrote:

I also used Clint's example and tried to map it to a document and
search the field, but still getting html in query results... Here is my
code. I appreciate the help.

//Tokenizer

PUT /foo/
{
"settings": {
"index" : {
"analysis" : {
"analyzer" : {
"test_1" : {
"char_filter" : [
"html_strip"
],
"tokenizer" : "standard"
}
}
}
}
}
}

//Mapping
PUT /foo/foo_type/_mapping
{
"foo_type":{
"properties" : {
"title": {
"type":"string",
"index": "analyzed",
"analyzer":"test_1"
}
}
}
}

Get /foo/foo_type/_mapping
{
"foo": {
"mappings": {
"foo_type": {
"properties": {
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"title": {
"type": "string",
"analyzer": "test_1"
}
}
}
}
}
}

////Index/////////////
PUT /foo/foo_type/1
{
"date" : "2009-11-15T14:12:12",
"title" : "The quick & brown fox"
}

//Search //////////
GET /foo/_search?pretty:true
{
"fields": ["title"],
"query": {
"query_string": {
"query": "brown",
"analyzer": "test_1"
}
}
}

//Results showing html tags still//////
"hits": [
{
"_index": "foo",
"_type": "foo_type",
"_id": "1",
"_score": 0.076713204,
"fields": {
"title": [
"The quick & brown fox"
]
}

On Thursday, August 7, 2014 6:06:56 PM UTC-4, Jörg Prante wrote:

Have you checked Clint's example?

HTML Strip charfilter test for ElasticSearch · GitHub

Jörg

On Thu, Aug 7, 2014 at 8:23 PM, IronMike sabda...@gmail.com wrote:

I would like to strip html tags for indexing. Here is a simple
example I tried so far, but doesn't seem to strip html tags. Any ideas
what's missing?

//settings & Mappings
POST twitter
{
"mappings": {
"tweet" : {
"properties" : {
"message" : {
"type" : "string",
"analyzer": "strip_html_analyzer"
},
"date" : {
"type" : "date"
},
"name" : {
"type" : "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"strip_html_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":"standard",
"char_filter":"my_html"
}
},
"char_filter": {
"my_html":{
"type":"html_strip"
}
}
}
}
}

//Index a document
PUT /twitter/tweet/1
{
"name" : "mike",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch, This is an
html test"
}

//query result for "html", I expect the query to return nothing since
it is supposed to strip the tag?
"hits": {
"total": 1,
"max_score": 0.11626227,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 0.11626227,
"fields": {
"message": [
"trying out Elasticsearch, This is an
html test"
]
},
"highlight": {
"message": [
"trying out Elasticsearch, This is an
html test"
]
}
}
]
}

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fee20a31-7d9c-45ef-b581-1892e2318f9e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

IronMike · August 8, 2014, 7:01pm

Also, Here is a link for someone who had the same problem, I am not sure if
there was a final answer to that one.
http://grokbase.com/t/gg/elasticsearch/126r4kv8tx/problem-with-standard-html-strip
,
I have to admit that I am a bit confused now about this topic. I understand
analyzers will tokenize the sentence and strip html in the case of the
html_strip, and _analyze works fine using the analyzer, what I am failing
to understand, is how can I get the results of these tokens. Isn't the
whole idea to be able to search for them tokens eventually?

If not, whats the solution of what I would think is a common scenario,
having to index html documents, where html tags don't need to be indexed,
while keeping the original html for presentational purpose? Any ideas
(Besides having to strip html tags manually before indexing?

On Friday, August 8, 2014 1:02:07 PM UTC-4, IronMike wrote:

Thanks for explaining. So, is there a way to be able to get non html from
the index? I thought I read that it was possible to index without the html
tags while keeping source intact. So, how would I get at the index with non
html tags if you will?

On Friday, August 8, 2014 12:52:37 PM UTC-4, Ivan Brusic wrote:

The field is derived from the source and not generated from the tokens.

If we indexed the sentence "The quick brown foxes jumped over the lazy
dogs" with the english analyzer, the tokens would be

http://localhost:9200/_analyze?text=The%20quick%20brown%20foxes%20jumped%20over%20the%20lazy%20dogs&analyzer=english

quick brown fox jump over lazi dog

After applying stopwords and stemming, the tokens do not form a sentence
that looks like the original.

--
Ivan

On Fri, Aug 8, 2014 at 9:42 AM, IronMike sabda...@gmail.com wrote:

Ivan,

The search results I am showing is for the field "title" not for the
source. I thought I could query the field not the source and look at it
with no html while the source was intact. Did I misunderstand?

On Friday, August 8, 2014 12:36:16 PM UTC-4, Ivan Brusic wrote:

The analyzers control how text is parsed/tokenized and how terms are
indexed in the inverted index. The source document remains untouched.

--
Ivan

On Fri, Aug 8, 2014 at 9:24 AM, IronMike sabda...@gmail.com wrote:

I also used Clint's example and tried to map it to a document and
search the field, but still getting html in query results... Here is my
code. I appreciate the help.

//Tokenizer

PUT /foo/
{
"settings": {
"index" : {
"analysis" : {
"analyzer" : {
"test_1" : {
"char_filter" : [
"html_strip"
],
"tokenizer" : "standard"
}
}
}
}
}
}

//Mapping
PUT /foo/foo_type/_mapping
{
"foo_type":{
"properties" : {
"title": {
"type":"string",
"index": "analyzed",
"analyzer":"test_1"
}
}
}
}

Get /foo/foo_type/_mapping
{
"foo": {
"mappings": {
"foo_type": {
"properties": {
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"title": {
"type": "string",
"analyzer": "test_1"
}
}
}
}
}
}

////Index/////////////
PUT /foo/foo_type/1
{
"date" : "2009-11-15T14:12:12",
"title" : "The quick & brown fox"
}

//Search //////////
GET /foo/_search?pretty:true
{
"fields": ["title"],
"query": {
"query_string": {
"query": "brown",
"analyzer": "test_1"
}
}
}

//Results showing html tags still//////
"hits": [
{
"_index": "foo",
"_type": "foo_type",
"_id": "1",
"_score": 0.076713204,
"fields": {
"title": [
"The quick & brown fox"
]
}

On Thursday, August 7, 2014 6:06:56 PM UTC-4, Jörg Prante wrote:

Have you checked Clint's example?

HTML Strip charfilter test for ElasticSearch · GitHub

Jörg

On Thu, Aug 7, 2014 at 8:23 PM, IronMike sabda...@gmail.com wrote:

I would like to strip html tags for indexing. Here is a simple
example I tried so far, but doesn't seem to strip html tags. Any ideas
what's missing?

//settings & Mappings
POST twitter
{
"mappings": {
"tweet" : {
"properties" : {
"message" : {
"type" : "string",
"analyzer": "strip_html_analyzer"
},
"date" : {
"type" : "date"
},
"name" : {
"type" : "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"strip_html_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":"standard",
"char_filter":"my_html"
}
},
"char_filter": {
"my_html":{
"type":"html_strip"
}
}
}
}
}

//Index a document
PUT /twitter/tweet/1
{
"name" : "mike",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch, This is an
html test"
}

//query result for "html", I expect the query to return nothing
since it is supposed to strip the tag?
"hits": {
"total": 1,
"max_score": 0.11626227,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 0.11626227,
"fields": {
"message": [
"trying out Elasticsearch, This is an
html test"
]
},
"highlight": {
"message": [
"trying out Elasticsearch, This is an
html test"
]
}
}
]
}

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b3
8-4646-bc8f-a27896454515%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · August 8, 2014, 8:57pm

The tokens that appear in the analyze API are the ones that are put into
the inverted index. When you search for one of the terms that is not an
HTML tag, there will be a match. What I don't understand after reading in
detail your original, is exactly what behavior you are expecting.

You indexed the phrase

trying out Elasticsearch, This is an html test

but you expected a query for the term "html" to not match. However, the
work "html" is clearly in the content. The html stripper will not remove
the contents in between the tags, just the tags themselve. The analyze API
should show you the correct term.

Lucene has more control over what information you can retrieve, but the
only way to get the analyzed token stream back from Elasticsearch is to use
the analyze API on the field. Most people do not want an analyzed token
stream, just the original field.

--
Ivan

On Fri, Aug 8, 2014 at 12:01 PM, IronMike sabdalla80@gmail.com wrote:

Also, Here is a link for someone who had the same problem, I am not sure
if there was a final answer to that one.
http://grokbase.com/t/gg/elasticsearch/126r4kv8tx/problem-with-standard-html-strip
,
I have to admit that I am a bit confused now about this topic. I
understand analyzers will tokenize the sentence and strip html in the case
of the html_strip, and _analyze works fine using the analyzer, what I am
failing to understand, is how can I get the results of these tokens. Isn't
the whole idea to be able to search for them tokens eventually?

If not, whats the solution of what I would think is a common scenario,
having to index html documents, where html tags don't need to be indexed,
while keeping the original html for presentational purpose? Any ideas
(Besides having to strip html tags manually before indexing?

On Friday, August 8, 2014 1:02:07 PM UTC-4, IronMike wrote:

Thanks for explaining. So, is there a way to be able to get non html from
the index? I thought I read that it was possible to index without the html
tags while keeping source intact. So, how would I get at the index with non
html tags if you will?

On Friday, August 8, 2014 12:52:37 PM UTC-4, Ivan Brusic wrote:

The field is derived from the source and not generated from the tokens.

If we indexed the sentence "The quick brown foxes jumped over the lazy
dogs" with the english analyzer, the tokens would be

http://localhost:9200/_analyze?text=The%20quick%
20brown%20foxes%20jumped%20over%20the%20lazy%20dogs&analyzer=english

quick brown fox jump over lazi dog

After applying stopwords and stemming, the tokens do not form a sentence
that looks like the original.

--
Ivan

On Fri, Aug 8, 2014 at 9:42 AM, IronMike sabda...@gmail.com wrote:

Ivan,

The search results I am showing is for the field "title" not for the
source. I thought I could query the field not the source and look at it
with no html while the source was intact. Did I misunderstand?

On Friday, August 8, 2014 12:36:16 PM UTC-4, Ivan Brusic wrote:

The analyzers control how text is parsed/tokenized and how terms are
indexed in the inverted index. The source document remains untouched.

--
Ivan

On Fri, Aug 8, 2014 at 9:24 AM, IronMike sabda...@gmail.com wrote:

I also used Clint's example and tried to map it to a document and
search the field, but still getting html in query results... Here is my
code. I appreciate the help.

//Tokenizer

PUT /foo/
{
"settings": {
"index" : {
"analysis" : {
"analyzer" : {
"test_1" : {
"char_filter" : [
"html_strip"
],
"tokenizer" : "standard"
}
}
}
}
}
}

//Mapping
PUT /foo/foo_type/_mapping
{
"foo_type":{
"properties" : {
"title": {
"type":"string",
"index": "analyzed",
"analyzer":"test_1"
}
}
}
}

Get /foo/foo_type/_mapping
{
"foo": {
"mappings": {
"foo_type": {
"properties": {
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"title": {
"type": "string",
"analyzer": "test_1"
}
}
}
}
}
}

////Index/////////////
PUT /foo/foo_type/1
{
"date" : "2009-11-15T14:12:12",
"title" : "The quick & brown fox"
}

//Search //////////
GET /foo/_search?pretty:true
{
"fields": ["title"],
"query": {
"query_string": {
"query": "brown",
"analyzer": "test_1"
}
}
}

//Results showing html tags still//////
"hits": [
{
"_index": "foo",
"_type": "foo_type",
"_id": "1",
"_score": 0.076713204,
"fields": {
"title": [
"The quick & brown fox"
]
}

On Thursday, August 7, 2014 6:06:56 PM UTC-4, Jörg Prante wrote:

Have you checked Clint's example?

HTML Strip charfilter test for ElasticSearch · GitHub

Jörg

On Thu, Aug 7, 2014 at 8:23 PM, IronMike sabda...@gmail.com wrote:

I would like to strip html tags for indexing. Here is a simple
example I tried so far, but doesn't seem to strip html tags. Any ideas
what's missing?

//settings & Mappings
POST twitter
{
"mappings": {
"tweet" : {
"properties" : {
"message" : {
"type" : "string",
"analyzer": "strip_html_analyzer"
},
"date" : {
"type" : "date"
},
"name" : {
"type" : "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"strip_html_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":"standard",
"char_filter":"my_html"
}
},
"char_filter": {
"my_html":{
"type":"html_strip"
}
}
}
}
}

//Index a document
PUT /twitter/tweet/1
{
"name" : "mike",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch, This is an
html test"
}

//query result for "html", I expect the query to return nothing
since it is supposed to strip the tag?
"hits": {
"total": 1,
"max_score": 0.11626227,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 0.11626227,
"fields": {
"message": [
"trying out Elasticsearch, This is
an html test"
]
},
"highlight": {
"message": [
"trying out Elasticsearch, This is
an html test"
]
}
}
]
}

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b3
8-4646-bc8f-a27896454515%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCQN7jkk5j%3DOJ3sN-KDWkdfmfOyEGFqnaE%2BnjXDiVpDpw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

IronMike · August 9, 2014, 12:16am

Thanks again. I wasn't expecting it to remove what's between the tags. I
believe I understand the behavior and maybe its the case where I was greedy
and expecting Elasticsearch to do it all.
Here is a scenario that I was looking for: Assume I am looking to get an
excerpt of text (Extracted text from a document), Elastic Search query will
give me excerpt with html tags, but the tags are out of context, so I would
have liked to be to display this excerpt with no html tags, I know I can
probably strip the tags after the fact, but that's what I was trying to
avoid. In other words, in a perfect world, I would have liked 2 versions
of the document, the original html one and another stripped one. When I
need to query things like excerpts, I would query the stripped one, and
when I needed the html, I would query the source. Hopefully I didn't make
this more confusing.

On Friday, August 8, 2014 4:58:03 PM UTC-4, Ivan Brusic wrote:

The tokens that appear in the analyze API are the ones that are put into
the inverted index. When you search for one of the terms that is not an
HTML tag, there will be a match. What I don't understand after reading in
detail your original, is exactly what behavior you are expecting.

You indexed the phrase
trying out Elasticsearch, This is an html test
but you expected a query for the term "html" to not match. However, the
work "html" is clearly in the content. The html stripper will not remove
the contents in between the tags, just the tags themselve. The analyze API
should show you the correct term.

Lucene has more control over what information you can retrieve, but the
only way to get the analyzed token stream back from Elasticsearch is to use
the analyze API on the field. Most people do not want an analyzed token
stream, just the original field.

--
Ivan

On Fri, Aug 8, 2014 at 12:01 PM, IronMike <sabda...@gmail.com
<javascript:>> wrote:

Also, Here is a link for someone who had the same problem, I am not sure
if there was a final answer to that one.
http://grokbase.com/t/gg/elasticsearch/126r4kv8tx/problem-with-standard-html-strip
,
I have to admit that I am a bit confused now about this topic. I
understand analyzers will tokenize the sentence and strip html in the case
of the html_strip, and _analyze works fine using the analyzer, what I am
failing to understand, is how can I get the results of these tokens. Isn't
the whole idea to be able to search for them tokens eventually?

If not, whats the solution of what I would think is a common scenario,
having to index html documents, where html tags don't need to be indexed,
while keeping the original html for presentational purpose? Any ideas
(Besides having to strip html tags manually before indexing?

On Friday, August 8, 2014 1:02:07 PM UTC-4, IronMike wrote:

Thanks for explaining. So, is there a way to be able to get non html
from the index? I thought I read that it was possible to index without the
html tags while keeping source intact. So, how would I get at the index
with non html tags if you will?

On Friday, August 8, 2014 12:52:37 PM UTC-4, Ivan Brusic wrote:

The field is derived from the source and not generated from the tokens.

If we indexed the sentence "The quick brown foxes jumped over the lazy
dogs" with the english analyzer, the tokens would be

http://localhost:9200/_analyze?text=The%20quick%
20brown%20foxes%20jumped%20over%20the%20lazy%20dogs&analyzer=english

quick brown fox jump over lazi dog

After applying stopwords and stemming, the tokens do not form a
sentence that looks like the original.

--
Ivan

On Fri, Aug 8, 2014 at 9:42 AM, IronMike sabda...@gmail.com wrote:

Ivan,

The search results I am showing is for the field "title" not for the
source. I thought I could query the field not the source and look at it
with no html while the source was intact. Did I misunderstand?

On Friday, August 8, 2014 12:36:16 PM UTC-4, Ivan Brusic wrote:

The analyzers control how text is parsed/tokenized and how terms are
indexed in the inverted index. The source document remains untouched.

--
Ivan

On Fri, Aug 8, 2014 at 9:24 AM, IronMike sabda...@gmail.com wrote:

I also used Clint's example and tried to map it to a document and
search the field, but still getting html in query results... Here is my
code. I appreciate the help.

//Tokenizer

PUT /foo/
{
"settings": {
"index" : {
"analysis" : {
"analyzer" : {
"test_1" : {
"char_filter" : [
"html_strip"
],
"tokenizer" : "standard"
}
}
}
}
}
}

//Mapping
PUT /foo/foo_type/_mapping
{
"foo_type":{
"properties" : {
"title": {
"type":"string",
"index": "analyzed",
"analyzer":"test_1"
}
}
}
}

Get /foo/foo_type/_mapping
{
"foo": {
"mappings": {
"foo_type": {
"properties": {
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"title": {
"type": "string",
"analyzer": "test_1"
}
}
}
}
}
}

////Index/////////////
PUT /foo/foo_type/1
{
"date" : "2009-11-15T14:12:12",
"title" : "The quick & brown fox"
}

//Search //////////
GET /foo/_search?pretty:true
{
"fields": ["title"],
"query": {
"query_string": {
"query": "brown",
"analyzer": "test_1"
}
}
}

//Results showing html tags still//////
"hits": [
{
"_index": "foo",
"_type": "foo_type",
"_id": "1",
"_score": 0.076713204,
"fields": {
"title": [
"The quick & brown fox"
]
}

On Thursday, August 7, 2014 6:06:56 PM UTC-4, Jörg Prante wrote:

Have you checked Clint's example?

HTML Strip charfilter test for ElasticSearch · GitHub

Jörg

On Thu, Aug 7, 2014 at 8:23 PM, IronMike sabda...@gmail.com
wrote:

I would like to strip html tags for indexing. Here is a simple
example I tried so far, but doesn't seem to strip html tags. Any ideas
what's missing?

//settings & Mappings
POST twitter
{
"mappings": {
"tweet" : {
"properties" : {
"message" : {
"type" : "string",
"analyzer": "strip_html_analyzer"
},
"date" : {
"type" : "date"
},
"name" : {
"type" : "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"strip_html_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":"standard",
"char_filter":"my_html"
}
},
"char_filter": {
"my_html":{
"type":"html_strip"
}
}
}
}
}

//Index a document
PUT /twitter/tweet/1
{
"name" : "mike",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch, This is an
html test"
}

//query result for "html", I expect the query to return nothing
since it is supposed to strip the tag?
"hits": {
"total": 1,
"max_score": 0.11626227,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 0.11626227,
"fields": {
"message": [
"trying out Elasticsearch, This is
an html test"
]
},
"highlight": {
"message": [
"trying out Elasticsearch, This is
an html test"
]
}
}
]
}

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b3
8-4646-bc8f-a27896454515%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47
c-4c35-a40b-058e3c1b1043%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/99b703a3-34df-4e96-8c8e-5f692b60ab09%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

IronMike · August 13, 2014, 3:16pm

Ivan,

A followup question, As I mentioned earlier storing html and applying
char-filter doesn't really work especially with highlighted fields coming
back with weird html display.
So, I am thinking stripping html before indexing, so no html in index and
source, but I will add an extra field like "html_content" which meant to
store the html version and not be indexed.
Do you see any problems with my approach? I see one like big index size.
What do you recommend for an ideal solution? I am still confused as I
thought this would be a common problem?

On Friday, August 8, 2014 8:16:09 PM UTC-4, IronMan wrote:

Thanks again. I wasn't expecting it to remove what's between the tags. I
believe I understand the behavior and maybe its the case where I was greedy
and expecting Elasticsearch to do it all.
Here is a scenario that I was looking for: Assume I am looking to get an
excerpt of text (Extracted text from a document), Elastic Search query will
give me excerpt with html tags, but the tags are out of context, so I would
have liked to be to display this excerpt with no html tags, I know I can
probably strip the tags after the fact, but that's what I was trying to
avoid. In other words, in a perfect world, I would have liked 2 versions
of the document, the original html one and another stripped one. When I
need to query things like excerpts, I would query the stripped one, and
when I needed the html, I would query the source. Hopefully I didn't make
this more confusing.

On Friday, August 8, 2014 4:58:03 PM UTC-4, Ivan Brusic wrote:

The tokens that appear in the analyze API are the ones that are put into
the inverted index. When you search for one of the terms that is not an
HTML tag, there will be a match. What I don't understand after reading in
detail your original, is exactly what behavior you are expecting.

You indexed the phrase
trying out Elasticsearch, This is an html test
but you expected a query for the term "html" to not match. However, the
work "html" is clearly in the content. The html stripper will not remove
the contents in between the tags, just the tags themselve. The analyze API
should show you the correct term.

Lucene has more control over what information you can retrieve, but the
only way to get the analyzed token stream back from Elasticsearch is to use
the analyze API on the field. Most people do not want an analyzed token
stream, just the original field.

--
Ivan

On Fri, Aug 8, 2014 at 12:01 PM, IronMike sabda...@gmail.com wrote:

Also, Here is a link for someone who had the same problem, I am not sure
if there was a final answer to that one.
http://grokbase.com/t/gg/elasticsearch/126r4kv8tx/problem-with-standard-html-strip
,
I have to admit that I am a bit confused now about this topic. I
understand analyzers will tokenize the sentence and strip html in the case
of the html_strip, and _analyze works fine using the analyzer, what I am
failing to understand, is how can I get the results of these tokens. Isn't
the whole idea to be able to search for them tokens eventually?

If not, whats the solution of what I would think is a common scenario,
having to index html documents, where html tags don't need to be indexed,
while keeping the original html for presentational purpose? Any ideas
(Besides having to strip html tags manually before indexing?

On Friday, August 8, 2014 1:02:07 PM UTC-4, IronMike wrote:

Thanks for explaining. So, is there a way to be able to get non html
from the index? I thought I read that it was possible to index without the
html tags while keeping source intact. So, how would I get at the index
with non html tags if you will?

On Friday, August 8, 2014 12:52:37 PM UTC-4, Ivan Brusic wrote:

The field is derived from the source and not generated from the tokens.

If we indexed the sentence "The quick brown foxes jumped over the lazy
dogs" with the english analyzer, the tokens would be

http://localhost:9200/_analyze?text=The%20quick%
20brown%20foxes%20jumped%20over%20the%20lazy%20dogs&analyzer=english

quick brown fox jump over lazi dog

After applying stopwords and stemming, the tokens do not form a
sentence that looks like the original.

--
Ivan

On Fri, Aug 8, 2014 at 9:42 AM, IronMike sabda...@gmail.com wrote:

Ivan,

The search results I am showing is for the field "title" not for the
source. I thought I could query the field not the source and look at it
with no html while the source was intact. Did I misunderstand?

On Friday, August 8, 2014 12:36:16 PM UTC-4, Ivan Brusic wrote:

The analyzers control how text is parsed/tokenized and how terms are
indexed in the inverted index. The source document remains untouched.

--
Ivan

On Fri, Aug 8, 2014 at 9:24 AM, IronMike sabda...@gmail.com wrote:

I also used Clint's example and tried to map it to a document and
search the field, but still getting html in query results... Here is my
code. I appreciate the help.

//Tokenizer

PUT /foo/
{
"settings": {
"index" : {
"analysis" : {
"analyzer" : {
"test_1" : {
"char_filter" : [
"html_strip"
],
"tokenizer" : "standard"
}
}
}
}
}
}

//Mapping
PUT /foo/foo_type/_mapping
{
"foo_type":{
"properties" : {
"title": {
"type":"string",
"index": "analyzed",
"analyzer":"test_1"
}
}
}
}

Get /foo/foo_type/_mapping
{
"foo": {
"mappings": {
"foo_type": {
"properties": {
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"title": {
"type": "string",
"analyzer": "test_1"
}
}
}
}
}
}

////Index/////////////
PUT /foo/foo_type/1
{
"date" : "2009-11-15T14:12:12",
"title" : "The quick & brown fox"
}

//Search //////////
GET /foo/_search?pretty:true
{
"fields": ["title"],
"query": {
"query_string": {
"query": "brown",
"analyzer": "test_1"
}
}
}

//Results showing html tags still//////
"hits": [
{
"_index": "foo",
"_type": "foo_type",
"_id": "1",
"_score": 0.076713204,
"fields": {
"title": [
"The quick & brown fox"
]
}

On Thursday, August 7, 2014 6:06:56 PM UTC-4, Jörg Prante wrote:

Have you checked Clint's example?

HTML Strip charfilter test for ElasticSearch · GitHub

Jörg

On Thu, Aug 7, 2014 at 8:23 PM, IronMike sabda...@gmail.com
wrote:

I would like to strip html tags for indexing. Here is a simple
example I tried so far, but doesn't seem to strip html tags. Any ideas
what's missing?

//settings & Mappings
POST twitter
{
"mappings": {
"tweet" : {
"properties" : {
"message" : {
"type" : "string",
"analyzer": "strip_html_analyzer"
},
"date" : {
"type" : "date"
},
"name" : {
"type" : "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"strip_html_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":"standard",
"char_filter":"my_html"
}
},
"char_filter": {
"my_html":{
"type":"html_strip"
}
}
}
}
}

//Index a document
PUT /twitter/tweet/1
{
"name" : "mike",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch, This is
an html test"
}

//query result for "html", I expect the query to return nothing
since it is supposed to strip the tag?
"hits": {
"total": 1,
"max_score": 0.11626227,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 0.11626227,
"fields": {
"message": [
"trying out Elasticsearch, This is
an html test"
]
},
"highlight": {
"message": [
"trying out Elasticsearch, This is
an html test"
]
}
}
]
}

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b3
8-4646-bc8f-a27896454515%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47
c-4c35-a40b-058e3c1b1043%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/28cbd510-d31c-4ab1-bd4a-6a87eade7953%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · August 19, 2014, 12:24am

Sorry if I have not replied sooner, but I was on vacation.

I would use the two fields solution, especially since you simply cannot
store a stripped version. The source field is compressed, so the additional
index size is content dependent. Never used highlighting, so I cannot
recommend alternative approaches.

I use jsoup to strip HTML before the data reaches Elasticsearch. Not sure
if it is the best, but I have been using it for years.

Cheers,

Ivan

On Wed, Aug 13, 2014 at 8:16 AM, IronMan2014 sabdalla80@gmail.com wrote:

Ivan,

A followup question, As I mentioned earlier storing html and applying
char-filter doesn't really work especially with highlighted fields coming
back with weird html display.
So, I am thinking stripping html before indexing, so no html in index and
source, but I will add an extra field like "html_content" which meant to
store the html version and not be indexed.
Do you see any problems with my approach? I see one like big index size.
What do you recommend for an ideal solution? I am still confused as I
thought this would be a common problem?

On Friday, August 8, 2014 8:16:09 PM UTC-4, IronMan wrote:

Thanks again. I wasn't expecting it to remove what's between the tags. I
believe I understand the behavior and maybe its the case where I was greedy
and expecting Elasticsearch to do it all.
Here is a scenario that I was looking for: Assume I am looking to get an
excerpt of text (Extracted text from a document), Elastic Search query will
give me excerpt with html tags, but the tags are out of context, so I would
have liked to be to display this excerpt with no html tags, I know I can
probably strip the tags after the fact, but that's what I was trying to
avoid. In other words, in a perfect world, I would have liked 2 versions
of the document, the original html one and another stripped one. When I
need to query things like excerpts, I would query the stripped one, and
when I needed the html, I would query the source. Hopefully I didn't make
this more confusing.

On Friday, August 8, 2014 4:58:03 PM UTC-4, Ivan Brusic wrote:

The tokens that appear in the analyze API are the ones that are put into
the inverted index. When you search for one of the terms that is not an
HTML tag, there will be a match. What I don't understand after reading in
detail your original, is exactly what behavior you are expecting.

You indexed the phrase
trying out Elasticsearch, This is an html test
but you expected a query for the term "html" to not match. However, the
work "html" is clearly in the content. The html stripper will not remove
the contents in between the tags, just the tags themselve. The analyze API
should show you the correct term.

Lucene has more control over what information you can retrieve, but the
only way to get the analyzed token stream back from Elasticsearch is to use
the analyze API on the field. Most people do not want an analyzed token
stream, just the original field.

--
Ivan

On Fri, Aug 8, 2014 at 12:01 PM, IronMike sabda...@gmail.com wrote:

Also, Here is a link for someone who had the same problem, I am not
sure if there was a final answer to that one. http://grokbase.com/t/gg/
elasticsearch/126r4kv8tx/problem-with-standard-html-strip,
I have to admit that I am a bit confused now about this topic. I
understand analyzers will tokenize the sentence and strip html in the case
of the html_strip, and _analyze works fine using the analyzer, what I am
failing to understand, is how can I get the results of these tokens. Isn't
the whole idea to be able to search for them tokens eventually?

If not, whats the solution of what I would think is a common scenario,
having to index html documents, where html tags don't need to be indexed,
while keeping the original html for presentational purpose? Any ideas
(Besides having to strip html tags manually before indexing?

On Friday, August 8, 2014 1:02:07 PM UTC-4, IronMike wrote:

Thanks for explaining. So, is there a way to be able to get non html
from the index? I thought I read that it was possible to index without the
html tags while keeping source intact. So, how would I get at the index
with non html tags if you will?

On Friday, August 8, 2014 12:52:37 PM UTC-4, Ivan Brusic wrote:

The field is derived from the source and not generated from the
tokens.

If we indexed the sentence "The quick brown foxes jumped over the
lazy dogs" with the english analyzer, the tokens would be

http://localhost:9200/_analyze?text=The%20quick%20brown%
20foxes%20jumped%20over%20the%20lazy%20dogs&analyzer=english

quick brown fox jump over lazi dog

After applying stopwords and stemming, the tokens do not form a
sentence that looks like the original.

--
Ivan

On Fri, Aug 8, 2014 at 9:42 AM, IronMike sabda...@gmail.com wrote:

Ivan,

The search results I am showing is for the field "title" not for the
source. I thought I could query the field not the source and look at it
with no html while the source was intact. Did I misunderstand?

On Friday, August 8, 2014 12:36:16 PM UTC-4, Ivan Brusic wrote:

The analyzers control how text is parsed/tokenized and how terms
are indexed in the inverted index. The source document remains untouched.

--
Ivan

On Fri, Aug 8, 2014 at 9:24 AM, IronMike sabda...@gmail.com
wrote:

I also used Clint's example and tried to map it to a document
and search the field, but still getting html in query results... Here is my
code. I appreciate the help.

//Tokenizer

PUT /foo/
{
"settings": {
"index" : {
"analysis" : {
"analyzer" : {
"test_1" : {
"char_filter" : [
"html_strip"
],
"tokenizer" : "standard"
}
}
}
}
}
}

//Mapping
PUT /foo/foo_type/_mapping
{
"foo_type":{
"properties" : {
"title": {
"type":"string",
"index": "analyzed",
"analyzer":"test_1"
}
}
}
}

Get /foo/foo_type/_mapping
{
"foo": {
"mappings": {
"foo_type": {
"properties": {
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"title": {
"type": "string",
"analyzer": "test_1"
}
}
}
}
}
}

////Index/////////////
PUT /foo/foo_type/1
{
"date" : "2009-11-15T14:12:12",
"title" : "The quick & brown fox"
}

//Search //////////
GET /foo/_search?pretty:true
{
"fields": ["title"],
"query": {
"query_string": {
"query": "brown",
"analyzer": "test_1"
}
}
}

//Results showing html tags still//////
"hits": [
{
"_index": "foo",
"_type": "foo_type",
"_id": "1",
"_score": 0.076713204,
"fields": {
"title": [
"The quick & brown fox"
]
}

On Thursday, August 7, 2014 6:06:56 PM UTC-4, Jörg Prante wrote:

Have you checked Clint's example?

HTML Strip charfilter test for ElasticSearch · GitHub

Jörg

On Thu, Aug 7, 2014 at 8:23 PM, IronMike sabda...@gmail.com
wrote:

I would like to strip html tags for indexing. Here is a simple
example I tried so far, but doesn't seem to strip html tags. Any ideas
what's missing?

//settings & Mappings
POST twitter
{
"mappings": {
"tweet" : {
"properties" : {
"message" : {
"type" : "string",
"analyzer": "strip_html_analyzer"
},
"date" : {
"type" : "date"
},
"name" : {
"type" : "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"strip_html_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":"standard",
"char_filter":"my_html"
}
},
"char_filter": {
"my_html":{
"type":"html_strip"
}
}
}
}
}

//Index a document
PUT /twitter/tweet/1
{
"name" : "mike",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch, This is
an html test"
}

//query result for "html", I expect the query to return nothing
since it is supposed to strip the tag?
"hits": {
"total": 1,
"max_score": 0.11626227,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 0.11626227,
"fields": {
"message": [
"trying out Elasticsearch, This
is an html test"
]
},
"highlight": {
"message": [
"trying out Elasticsearch, This
is an html test"
]
}
}
]
}

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b3
8-4646-bc8f-a27896454515%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47
c-4c35-a40b-058e3c1b1043%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d0
8-4a76-9717-dee201794be4%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/28cbd510-d31c-4ab1-bd4a-6a87eade7953%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/28cbd510-d31c-4ab1-bd4a-6a87eade7953%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCvpV%2BrFufKQd4jfdsXUt%2BfdwoqaD4p075kC80jr2Ly9Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
How to use html_strip Char filter? Elasticsearch	5	1833	July 6, 2017
How can I index HTML tags Elasticsearch	4	501	March 22, 2021
Adding html_strip filter Elasticsearch	6	320	December 27, 2022
HTML Filter - How do I use it in a search? Elasticsearch	5	567	March 16, 2018
Strip_html Elasticsearch	4	701	July 6, 2017

How to get char_filter to work?

Related topics