Insert hidden metadata in text?

A_Zed · December 4, 2012, 6:16pm

Is it possible to insert some hidden metadata inside the text, so that it
is not
indexed; but retrieved in response to a query?
I would like to annotate the text I'm indexing with some tags which I don't
want indexed; but upon retrieval, these tags would help in rendering the
output nicely. Suggestions?

--

Ivan · December 4, 2012, 6:28pm

You can simply mark the field is not indexed ("index" : "no") in your
mapping. If you are using the _all field (not explicitly searching against
a specific field(s), then you would also have to exclude it from the _all
field ("include_in_all" : "no").

The field will be returned as part of the source. If source is disabled,
then you would need to ask for that field to be returned as well.

Cheers,

Ivan

On Tue, Dec 4, 2012 at 10:16 AM, A Zed exmaple@gmail.com wrote:

Is it possible to insert some hidden metadata inside the text, so that it
is not
indexed; but retrieved in response to a query?
I would like to annotate the text I'm indexing with some tags which I don't
want indexed; but upon retrieval, these tags would help in rendering the
output nicely. Suggestions?

--

--

A_Zed · December 4, 2012, 6:36pm

It's not a different field; it's some tags that I want to add to the text,
but I don't want the indexer to index those tags.
I guess I could just prepend some random string in front of the tags and
hope that a query never matches that string;
but I was hoping for a cleaner solution. :-/

On Tuesday, December 4, 2012 10:28:30 AM UTC-8, Ivan Brusic wrote:

You can simply mark the field is not indexed ("index" : "no") in your
mapping. If you are using the _all field (not explicitly searching against
a specific field(s), then you would also have to exclude it from the _all
field ("include_in_all" : "no").

Elasticsearch Platform — Find real-time answers at scale | Elastic

The field will be returned as part of the source. If source is disabled,
then you would need to ask for that field to be returned as well.

Cheers,

Ivan

On Tue, Dec 4, 2012 at 10:16 AM, A Zed <exm...@gmail.com <javascript:>>wrote:

Is it possible to insert some hidden metadata inside the text, so that it
is not
indexed; but retrieved in response to a query?
I would like to annotate the text I'm indexing with some tags which I
don't
want indexed; but upon retrieval, these tags would help in rendering the
output nicely. Suggestions?

--

--

Ivan · December 4, 2012, 6:53pm

Why not use a new field then? When you say "some tags that I want to add to
the text", that use case says "new field" to me.

--
Ivan

On Tue, Dec 4, 2012 at 10:36 AM, A Zed exmaple@gmail.com wrote:

It's not a different field; it's some tags that I want to add to the text,
but I don't want the indexer to index those tags.
I guess I could just prepend some random string in front of the tags and
hope that a query never matches that string;
but I was hoping for a cleaner solution. :-/

On Tuesday, December 4, 2012 10:28:30 AM UTC-8, Ivan Brusic wrote:

You can simply mark the field is not indexed ("index" : "no") in your
mapping. If you are using the _all field (not explicitly searching against
a specific field(s), then you would also have to exclude it from the _all
field ("include_in_all" : "no").

Elasticsearch Platform — Find real-time answers at scale | Elastic http://www.elasticsearch.org/guide/reference/mapping/core-types.html

The field will be returned as part of the source. If source is disabled,
then you would need to ask for that field to be returned as well.

Cheers,

Ivan

On Tue, Dec 4, 2012 at 10:16 AM, A Zed exm...@gmail.com wrote:

Is it possible to insert some hidden metadata inside the text, so that
it is not
indexed; but retrieved in response to a query?
I would like to annotate the text I'm indexing with some tags which I
don't
want indexed; but upon retrieval, these tags would help in rendering the
output nicely. Suggestions?

--

--

--

A_Zed · December 4, 2012, 6:56pm

I have a large text document. I want to add annotations to it (inside the
text) so I can render the matches
in a more interesting manner. An example would be: I want to add hyperlinks
to some words in the
document, but I don't want ES to index those hyperlinks; and ES should
return the snippet with
hyperlinks if there's a match.

On Tuesday, December 4, 2012 10:53:19 AM UTC-8, Ivan Brusic wrote:

Why not use a new field then? When you say "some tags that I want to add
to the text", that use case says "new field" to me.

--
Ivan

On Tue, Dec 4, 2012 at 10:36 AM, A Zed <exm...@gmail.com <javascript:>>wrote:

It's not a different field; it's some tags that I want to add to the
text, but I don't want the indexer to index those tags.
I guess I could just prepend some random string in front of the tags and
hope that a query never matches that string;
but I was hoping for a cleaner solution. :-/

On Tuesday, December 4, 2012 10:28:30 AM UTC-8, Ivan Brusic wrote:

You can simply mark the field is not indexed ("index" : "no") in your
mapping. If you are using the _all field (not explicitly searching against
a specific field(s), then you would also have to exclude it from the _all
field ("include_in_all" : "no").

Elasticsearch Platform — Find real-time answers at scale | Elastic http://www.elasticsearch.org/guide/reference/mapping/core-types.html

The field will be returned as part of the source. If source is disabled,
then you would need to ask for that field to be returned as well.

Cheers,

Ivan

On Tue, Dec 4, 2012 at 10:16 AM, A Zed exm...@gmail.com wrote:

Is it possible to insert some hidden metadata inside the text, so that
it is not
indexed; but retrieved in response to a query?
I would like to annotate the text I'm indexing with some tags which I
don't
want indexed; but upon retrieval, these tags would help in rendering the
output nicely. Suggestions?

--

--

--

Ivan · December 4, 2012, 7:25pm

Got it. Maintaining local context specific tags is more difficult. If you
were using Lucene, you could use Payloads, but Elasticsearch does not
expose them (not even sure how you can).

You can implement your own token/char filters that know how to parse your
text at index time so that only the relevant tokens make it to the index.
Take a look at the HTML Strip Char Filter for some guidance on how to
create one. Hopefully you have control over how your text is generated so
that you can create a scheme that can be control on the Elasticsearch side
with a filter.

Cheers,

Ivan

On Tue, Dec 4, 2012 at 10:56 AM, A Zed exmaple@gmail.com wrote:

I have a large text document. I want to add annotations to it (inside the
text) so I can render the matches
in a more interesting manner. An example would be: I want to add
hyperlinks to some words in the
document, but I don't want ES to index those hyperlinks; and ES should
return the snippet with
hyperlinks if there's a match.

On Tuesday, December 4, 2012 10:53:19 AM UTC-8, Ivan Brusic wrote:

Why not use a new field then? When you say "some tags that I want to add
to the text", that use case says "new field" to me.

--
Ivan

On Tue, Dec 4, 2012 at 10:36 AM, A Zed exm...@gmail.com wrote:

It's not a different field; it's some tags that I want to add to the
text, but I don't want the indexer to index those tags.
I guess I could just prepend some random string in front of the tags and
hope that a query never matches that string;
but I was hoping for a cleaner solution. :-/

On Tuesday, December 4, 2012 10:28:30 AM UTC-8, Ivan Brusic wrote:

You can simply mark the field is not indexed ("index" : "no") in your
mapping. If you are using the _all field (not explicitly searching against
a specific field(s), then you would also have to exclude it from the _all
field ("include_in_all" : "no").

Elasticsearch Platform — Find real-time answers at scale | Elastic**
pes.htmlhttp://www.elasticsearch.org/guide/reference/mapping/core-types.html

The field will be returned as part of the source. If source is
disabled, then you would need to ask for that field to be returned as well.

Cheers,

Ivan

On Tue, Dec 4, 2012 at 10:16 AM, A Zed exm...@gmail.com wrote:

Is it possible to insert some hidden metadata inside the text, so that
it is not
indexed; but retrieved in response to a query?
I would like to annotate the text I'm indexing with some tags which I
don't
want indexed; but upon retrieval, these tags would help in rendering
the
output nicely. Suggestions?

--

--

--

--

A_Zed · December 4, 2012, 8:48pm

(Google groups is not showing me Ivan's latest response, so apologies).

I can insert this metadata as HTML tags and use the HTML Strip Char filter.
But I can't seem to get the filter to work.
I created an index:
curl -XPUT 'http://localhost:9200/htmltest/?pretty=1' -d '
{
"mappings" : {
"member" : {
"properties" : {
"text" : {
"type" : "string",
"analyzer" : "htest"
}
}
}
},
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"htest" : {
"filter" : [
"standard",
"lowercase",
"stop",
"asciifolding"
],
"char_filter" : [
"html_strip"
],
"tokenizer" : "standard"
}
}
}
}
}
}'

Then added a couple of docs:
curl -XPUT 'http://localhost:9200/htmltest/post/1' -d '{"title": "dilbert",
"text": "Search is hard. Search

should be easy."}'
curl -XPUT 'http://localhost:9200/htmltest/post/2' -d '{"title": "dilbert",
"text": "Search is hard. Search should very, very be easy."}'

Now, if I search for "blah", I should not get anything, right? But it fails:
curl -XGET 'http://localhost:9200/htmltest/post/_search?pretty=1' -d
'{"fields":["title"],"query":{"bool":{"should":[{"text":{"text":{"query":"blah"}}}]}},"highlight":{"fields":{"text":{"fragment_size":200}}}}'

{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.095891505,
"hits" : [ {
"_index" : "htmltest",
"_type" : "post",
"_id" : "1",
"_score" : 0.095891505,
"fields" : {
"title" : "dilbert"
},
"highlight" : {
"text" : [ "Search is hard. Search <div id=blah> should
very, very be easy." ]
}
} ]
}
}

--

A_Zed · December 4, 2012, 8:54pm

Minor typo: The returned snippet should be:

  "highlight" : {
    "text" : [ "Search is hard. Search <div id=<em>blah</em>> should be

easy." ]
}

(I pasted incorrectly; been trying various things, so there was some
confusion)

On Tuesday, December 4, 2012 12:48:59 PM UTC-8, A Zed wrote:

(Google groups is not showing me Ivan's latest response, so apologies).

I can insert this metadata as HTML tags and use the HTML Strip Char filter.
But I can't seem to get the filter to work.
I created an index:
curl -XPUT 'http://localhost:9200/htmltest/?pretty=1' -d '
{
"mappings" : {
"member" : {
"properties" : {
"text" : {
"type" : "string",
"analyzer" : "htest"
}
}
}
},
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"htest" : {
"filter" : [
"standard",
"lowercase",
"stop",
"asciifolding"
],
"char_filter" : [
"html_strip"
],
"tokenizer" : "standard"
}
}
}
}
}
}'

Then added a couple of docs:
curl -XPUT 'http://localhost:9200/htmltest/post/1' -d '{"title":
"dilbert", "text": "Search is hard. Search
should be easy."}'
curl -XPUT 'http://localhost:9200/htmltest/post/2' -d '{"title":
"dilbert", "text": "Search is hard. Search should very, very be easy."}'

Now, if I search for "blah", I should not get anything, right? But it
fails:
curl -XGET 'http://localhost:9200/htmltest/post/_search?pretty=1' -d
'{"fields":["title"],"query":{"bool":{"should":[{"text":{"text":{"query":"blah"}}}]}},"highlight":{"fields":{"text":{"fragment_size":200}}}}'

{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.095891505,
"hits" : [ {
"_index" : "htmltest",
"_type" : "post",
"_id" : "1",
"_score" : 0.095891505,
"fields" : {
"title" : "dilbert"
},
"highlight" : {
"text" : [ "Search is hard. Search <div id=blah> should
very, very be easy." ]
}
} ]
}
}

--

Kay_Ropke · December 4, 2012, 9:29pm

Really the only thing I can think of is writing a custom analyzer which strips the tags and does whatever tokenizing/filtering etc you need for searching.
The in the source field you would still have the tags.

cheers,
-k

On Dec 4, 2012, at 9:54 PM, A Zed exmaple@gmail.com wrote:

Minor typo: The returned snippet should be:
  "highlight" : {
    "text" : [ "Search is hard. Search <div id=<em>blah</em>> should be easy." ]
  }
(I pasted incorrectly; been trying various things, so there was some confusion)

On Tuesday, December 4, 2012 12:48:59 PM UTC-8, A Zed wrote:

(Google groups is not showing me Ivan's latest response, so apologies).

I can insert this metadata as HTML tags and use the HTML Strip Char filter.
But I can't seem to get the filter to work.
I created an index:
curl -XPUT 'http://localhost:9200/htmltest/?pretty=1' -d '
{
"mappings" : {
"member" : {
"properties" : {
"text" : {
"type" : "string",
"analyzer" : "htest"
}
}
}
},
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"htest" : {
"filter" : [
"standard",
"lowercase",
"stop",
"asciifolding"
],
"char_filter" : [
"html_strip"
],
"tokenizer" : "standard"
}
}
}
}
}
}'

Then added a couple of docs:
curl -XPUT 'http://localhost:9200/htmltest/post/1' -d '{"title": "dilbert", "text": "Search is hard. Search
should be easy."}'
curl -XPUT 'http://localhost:9200/htmltest/post/2' -d '{"title": "dilbert", "text": "Search is hard. Search should very, very be easy."}'

Now, if I search for "blah", I should not get anything, right? But it fails:
curl -XGET 'http://localhost:9200/htmltest/post/_search?pretty=1' -d '{"fields":["title"],"query":{"bool":{"should":[{"text":{"text":{"query":"blah"}}}]}},"highlight":{"fields":{"text":{"fragment_size":200}}}}'

{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.095891505,
"hits" : [ {
"_index" : "htmltest",
"_type" : "post",
"_id" : "1",
"_score" : 0.095891505,
"fields" : {
"title" : "dilbert"
},
"highlight" : {
"text" : [ "Search is hard. Search <div id=blah> should very, very be easy." ]
}
} ]
}
}
--

--

Ivan · December 4, 2012, 10:35pm

I agree with Kay, which is what I was trying to suggest. I don't think the
HTML strip filter is the answer, but that code should help you implement
your own analyzer.

However, if you can use the HTML strip filter for your needs, then you
should use it. I am all about abusing APIs in ways that are not standard!
You can encode information in HTML tags even if you are not using HTML.
Never used the filter myself, but you should use the Analysis API to see
what tokens are getting created.

Cheers,

Ivan

On Tue, Dec 4, 2012 at 1:29 PM, Kay Röpke kroepke@gmail.com wrote:

Really the only thing I can think of is writing a custom analyzer which
strips the tags and does whatever tokenizing/filtering etc you need for
searching.
The in the source field you would still have the tags.

cheers,
-k

On Dec 4, 2012, at 9:54 PM, A Zed exmaple@gmail.com wrote:

Minor typo: The returned snippet should be:
  "highlight" : {
    "text" : [ "Search is hard. Search <div id=<em>blah</em>> should
be easy." ]
}

(I pasted incorrectly; been trying various things, so there was some
confusion)

On Tuesday, December 4, 2012 12:48:59 PM UTC-8, A Zed wrote:

(Google groups is not showing me Ivan's latest response, so apologies).

I can insert this metadata as HTML tags and use the HTML Strip Char
filter.
But I can't seem to get the filter to work.
I created an index:
curl -XPUT 'http://localhost:9200/**htmltest/?pretty=1 http://localhost:9200/htmltest/?pretty=1'
-d '
{
"mappings" : {
"member" : {
"properties" : {
** "text" : {
** "type" : "string",
** "analyzer" : "htest"
** }
}
}
},
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
** "htest" : {
** "filter" : [
** "standard",
** "lowercase",
** "stop",
** "asciifolding"
** ],
** "char_filter" : [
** "html_strip"
** ],
** "tokenizer" : "standard"
** }
}
}
}
}
}'

Then added a couple of docs:
curl -XPUT 'http://localhost:9200/**htmltest/post/1 http://localhost:9200/htmltest/post/1'
-d '{"title": "dilbert", "text": "Search is hard. Search

should be easy."}'
curl -XPUT 'http://localhost:9200/**htmltest/post/2 http://localhost:9200/htmltest/post/2'
-d '{"title": "dilbert", "text": "Search is hard. Search should very, very
be easy."}'

Now, if I search for "blah", I should not get anything, right? But it
fails:
curl -XGET 'http://localhost:9200/**htmltest/post/_search?pretty=1 http://localhost:9200/htmltest/post/_search?pretty=1
' -d '{"fields":["title"],"query":{"bool":{"should":[{"text":{"**
text":{"query":"blah"}}}]}},"highlight":{"fields":{"text":{
"fragment_size":200}}}}'

{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.095891505,
"hits" : [ {
"_index" : "htmltest",
"_type" : "post",
"_id" : "1",
"_score" : 0.095891505,
"fields" : {
"title" : "dilbert"
},
"highlight" : {
"text" : [ "Search is hard. Search <div id=blah> should
very, very be easy." ]
}
} ]
}
}

--

--

--

A_Zed · December 4, 2012, 10:57pm

Ivan and Kay: I can use HTML tags. But for some reason can't get even the
HTML Strip Char filter working :-((
I'm very new to this, and still bumbling around, so please bear with me.

I created an index:
curl -XPUT 'http://localhost:9200/htmltest/?pretty=1' -d '{"mappings" :
{"member" : {"properties" : {"body" : {"type" : "string", "analyzer" :
"htest" } } } }, "settings" : {"index" : {"analysis" : {"analyzer" :
{"htest" : {"filter" : [ "standard", "lowercase", "stop", "asciifolding" ],
"char_filter" : [ "html_strip" ], "tokenizer" : "standard" }}}}}}'

Added 2 docs:
curl -XPUT 'http://localhost:9200/htmltest/post/1' -d '{"title": "dilbert",
"body": "Search is hard. Search

should be easy.

"}'
curl -XPUT 'http://localhost:9200/htmltest/post/2' -d '{"title": "dilbert",
"body": "Text is hard. Text should be easy."}'

Ideally: if I query for "blah", I should not get a hit. But I do! Where am
I going wrong?
curl -XGET
"http://localhost:9200/htmltest/post/_search?q=body:blah&_pretty=1"

{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.095891505,"hits":[{"_index":"htmltest","_type":"post","_id":"1","_score":0.095891505,
"_source" : {"title": "dilbert", "body": "Search is hard. Search

should be easy.

"}}]}}

On Tuesday, December 4, 2012 2:35:37 PM UTC-8, Ivan Brusic wrote:

I agree with Kay, which is what I was trying to suggest. I don't think the
HTML strip filter is the answer, but that code should help you implement
your own analyzer.

However, if you can use the HTML strip filter for your needs, then you
should use it. I am all about abusing APIs in ways that are not standard!
You can encode information in HTML tags even if you are not using HTML.
Never used the filter myself, but you should use the Analysis API to see
what tokens are getting created.

Cheers,

Ivan

On Tue, Dec 4, 2012 at 1:29 PM, Kay Röpke <kro...@gmail.com <javascript:>>wrote:
Really the only thing I can think of is writing a custom analyzer which
strips the tags and does whatever tokenizing/filtering etc you need for
searching.
The in the source field you would still have the tags.

cheers,
-k

On Dec 4, 2012, at 9:54 PM, A Zed <exm...@gmail.com <javascript:>>
wrote:

Minor typo: The returned snippet should be:
  "highlight" : {
    "text" : [ "Search is hard. Search <div id=<em>blah</em>> should 
be easy." ]
}

(I pasted incorrectly; been trying various things, so there was some
confusion)

On Tuesday, December 4, 2012 12:48:59 PM UTC-8, A Zed wrote:

(Google groups is not showing me Ivan's latest response, so apologies).

I can insert this metadata as HTML tags and use the HTML Strip Char
filter.
But I can't seem to get the filter to work.
I created an index:
curl -XPUT 'http://localhost:9200/**htmltest/?pretty=1 http://localhost:9200/htmltest/?pretty=1'
-d '
{
"mappings" : {
"member" : {
"properties" : {
** "text" : {
** "type" : "string",
** "analyzer" : "htest"
** }
}
}
},
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
** "htest" : {
** "filter" : [
** "standard",
** "lowercase",
** "stop",
** "asciifolding"
** ],
** "char_filter" : [
** "html_strip"
** ],
** "tokenizer" : "standard"
** }
}
}
}
}
}'

Then added a couple of docs:
curl -XPUT 'http://localhost:9200/**htmltest/post/1 http://localhost:9200/htmltest/post/1'
-d '{"title": "dilbert", "text": "Search is hard. Search

should be easy."}'
curl -XPUT 'http://localhost:9200/**htmltest/post/2 http://localhost:9200/htmltest/post/2'
-d '{"title": "dilbert", "text": "Search is hard. Search should very, very
be easy."}'

Now, if I search for "blah", I should not get anything, right? But it
fails:
curl -XGET 'http://localhost:9200/**htmltest/post/_search?pretty=1 http://localhost:9200/htmltest/post/_search?pretty=1
' -d '{"fields":["title"],"query":{"bool":{"should":[{"text":{"**
text":{"query":"blah"}}}]}},"highlight":{"fields":{"text":{
"fragment_size":200}}}}'

{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.095891505,
"hits" : [ {
"_index" : "htmltest",
"_type" : "post",
"_id" : "1",
"_score" : 0.095891505,
"fields" : {
"title" : "dilbert"
},
"highlight" : {
"text" : [ "Search is hard. Search <div id=blah> should
very, very be easy." ]
}
} ]
}
}

--

--

--

Ivan · December 5, 2012, 12:06am

That is bizarre. The analysis API shows that the analyzer is working
correctly, but when I inspect the index in Luke, the char_filter is not
getting applied.

http://localhost:9200/htmltest/_analyze?analyzer=htest&text=Search%20is%20hard.%20Search%20<div%20id%3Dblah>%20should%20be%20easy.<%2Fdiv>

The terms are lowercased and stop words removed, but the terms removed by
the html strip filter are in the index.

On Tue, Dec 4, 2012 at 2:57 PM, A Zed exmaple@gmail.com wrote:

Ivan and Kay: I can use HTML tags. But for some reason can't get even the
HTML Strip Char filter working :-((
I'm very new to this, and still bumbling around, so please bear with me.

I created an index:
curl -XPUT 'http://localhost:9200/htmltest/?pretty=1' -d '{"mappings" :
{"member" : {"properties" : {"body" : {"type" : "string", "analyzer" :
"htest" } } } }, "settings" : {"index" : {"analysis" : {"analyzer" :
{"htest" : {"filter" : [ "standard", "lowercase", "stop", "asciifolding" ],
"char_filter" : [ "html_strip" ], "tokenizer" : "standard" }}}}}}'

Added 2 docs:
curl -XPUT 'http://localhost:9200/htmltest/post/1' -d '{"title":
"dilbert", "body": "Search is hard. Search
should be
easy.
"}'
curl -XPUT 'http://localhost:9200/htmltest/post/2' -d '{"title":
"dilbert", "body": "Text is hard. Text should be easy."}'

Ideally: if I query for "blah", I should not get a hit. But I do! Where am
I going wrong?
curl -XGET "
http://localhost:9200/htmltest/post/_search?q=body:blah&_pretty=1"

{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.095891505,"hits":[{"_index":"htmltest","_type":"post","_id":"1","_score":0.095891505,
"_source" : {"title": "dilbert", "body": "Search is hard. Search
should be easy.
"}}]}}

On Tuesday, December 4, 2012 2:35:37 PM UTC-8, Ivan Brusic wrote:
I agree with Kay, which is what I was trying to suggest. I don't think
the HTML strip filter is the answer, but that code should help you
implement your own analyzer.

However, if you can use the HTML strip filter for your needs, then you
should use it. I am all about abusing APIs in ways that are not standard!
You can encode information in HTML tags even if you are not using HTML.
Never used the filter myself, but you should use the Analysis API to see
what tokens are getting created.

Cheers,

Ivan

On Tue, Dec 4, 2012 at 1:29 PM, Kay Röpke kro...@gmail.com wrote:
Really the only thing I can think of is writing a custom analyzer which
strips the tags and does whatever tokenizing/filtering etc you need for
searching.
The in the source field you would still have the tags.

cheers,
-k

On Dec 4, 2012, at 9:54 PM, A Zed exm...@gmail.com wrote:

Minor typo: The returned snippet should be:
  "highlight" : {
    "text" : [ "Search is hard. Search <div id=<em>blah</em>> should
be easy." ]
}

(I pasted incorrectly; been trying various things, so there was some
confusion)

On Tuesday, December 4, 2012 12:48:59 PM UTC-8, A Zed wrote:

(Google groups is not showing me Ivan's latest response, so apologies).

I can insert this metadata as HTML tags and use the HTML Strip Char
filter.
But I can't seem to get the filter to work.
I created an index:
curl -XPUT 'http://localhost:9200/**htmltes**t/?pretty=1 http://localhost:9200/htmltest/?pretty=1'
-d '
{
"mappings" : {
"member" : {
"properties" : {
**** "text" : {
**** "type" : "string",
**** "analyzer" : "htest"
**** }
}
}
},
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
**** "htest" : {
**** "filter" : [
**** "standard",
**** "lowercase",
**** "stop",
**** "asciifolding"
**** ],
**** "char_filter" : [
**** "html_strip"
**** ],
**** "tokenizer" : "standard"
**** }
}
}
}
}
}'

Then added a couple of docs:
curl -XPUT 'http://localhost:9200/**htmltes**t/post/1 http://localhost:9200/htmltest/post/1'
-d '{"title": "dilbert", "text": "Search is hard. Search

should be easy."}'
curl -XPUT 'http://localhost:9200/**htmltes**t/post/2 http://localhost:9200/htmltest/post/2'
-d '{"title": "dilbert", "text": "Search is hard. Search should very, very
be easy."}'

Now, if I search for "blah", I should not get anything, right? But it
fails:
curl -XGET 'http://localhost:9200/**htmltes**t/post/_search?pretty=1 http://localhost:9200/htmltest/post/_search?pretty=1
' -d '{"fields":["title"],"query":{****"bool":{"should":[{"text":{"
text":{"query":"blah"}}}]}},"highlight":{"fields":{"text":{"**
fragment_size":200}}}}'

{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.095891505,
"hits" : [ {
"_index" : "htmltest",
"_type" : "post",
"_id" : "1",
"_score" : 0.095891505,
"fields" : {
"title" : "dilbert"
},
"highlight" : {
"text" : [ "Search is hard. Search <div id=blah>
should very, very be easy." ]
}
} ]
}
}

--

--
--

--

A_Zed · December 5, 2012, 12:22am

I am happy to hear that I'm not the only one who is baffled.

On Tuesday, December 4, 2012 4:06:45 PM UTC-8, Ivan Brusic wrote:

That is bizarre. The analysis API shows that the analyzer is working
correctly, but when I inspect the index in Luke, the char_filter is not
getting applied.

http://localhost:9200/htmltest/_analyze?analyzer=htest&text=Search%20is%20hard.%20Search%20<div%20id%3Dblah>%20should%20be%20easy.<%2Fdiv>

The terms are lowercased and stop words removed, but the terms removed by
the html strip filter are in the index.

--

A_Zed · December 5, 2012, 8:12pm

I must apologize for wasting everyone's time. It was a pure case of
programmer error.
The reason it was not working is that I had made a mistake in the
"mappings". I used a
different doc type (cut-n-paste error). With the right doc type in
mappings, it works fine.
Sorry for the confusion. Mea culpa.

On Tuesday, December 4, 2012 4:22:06 PM UTC-8, A Zed wrote:

I am happy to hear that I'm not the only one who is baffled.

On Tuesday, December 4, 2012 4:06:45 PM UTC-8, Ivan Brusic wrote:

That is bizarre. The analysis API shows that the analyzer is working
correctly, but when I inspect the index in Luke, the char_filter is not
getting applied.

http://localhost:9200/htmltest/_analyze?analyzer=htest&text=Search%20is%20hard.%20Search%20<div%20id%3Dblah>%20should%20be%20easy.<%2Fdiv>

The terms are lowercased and stop words removed, but the terms removed by
the html strip filter are in the index.

--