Ivan and Kay: I can use HTML tags. But for some reason can't get even the
HTML Strip Char filter working :-((
I'm very new to this, and still bumbling around, so please bear with me.
I created an index:
curl -XPUT 'http://localhost:9200/htmltest/?pretty=1' -d '{"mappings" :
{"member" : {"properties" : {"body" : {"type" : "string", "analyzer" :
"htest" } } } }, "settings" : {"index" : {"analysis" : {"analyzer" :
{"htest" : {"filter" : [ "standard", "lowercase", "stop", "asciifolding" ],
"char_filter" : [ "html_strip" ], "tokenizer" : "standard" }}}}}}'
Added 2 docs:
curl -XPUT 'http://localhost:9200/htmltest/post/1' -d '{"title": "dilbert",
"body": "Search is hard. Search
should be easy.
"}'
curl -XPUT '
http://localhost:9200/htmltest/post/2' -d '{"title": "dilbert",
"body": "Text is hard. Text should be easy."}'
Ideally: if I query for "blah", I should not get a hit. But I do! Where am
I going wrong?
curl -XGET
"http://localhost:9200/htmltest/post/_search?q=body:blah&_pretty=1"
{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.095891505,"hits":[{"_index":"htmltest","_type":"post","_id":"1","_score":0.095891505,
"_source" : {"title": "dilbert", "body": "Search is hard. Search
should be easy.
"}}]}}
On Tuesday, December 4, 2012 2:35:37 PM UTC-8, Ivan Brusic wrote:
I agree with Kay, which is what I was trying to suggest. I don't think the
HTML strip filter is the answer, but that code should help you implement
your own analyzer.
However, if you can use the HTML strip filter for your needs, then you
should use it. I am all about abusing APIs in ways that are not standard!
You can encode information in HTML tags even if you are not using HTML.
Never used the filter myself, but you should use the Analysis API to see
what tokens are getting created.
Cheers,
Ivan
On Tue, Dec 4, 2012 at 1:29 PM, Kay Röpke <kro...@gmail.com <javascript:>>wrote:
Really the only thing I can think of is writing a custom analyzer which
strips the tags and does whatever tokenizing/filtering etc you need for
searching.
The in the source field you would still have the tags.
cheers,
-k
On Dec 4, 2012, at 9:54 PM, A Zed <exm...@gmail.com <javascript:>>
wrote:
Minor typo: The returned snippet should be:
"highlight" : {
"text" : [ "Search is hard. Search <div id=<em>blah</em>> should
be easy." ]
}
(I pasted incorrectly; been trying various things, so there was some
confusion)
On Tuesday, December 4, 2012 12:48:59 PM UTC-8, A Zed wrote:
(Google groups is not showing me Ivan's latest response, so apologies).
I can insert this metadata as HTML tags and use the HTML Strip Char
filter.
But I can't seem to get the filter to work.
I created an index:
curl -XPUT 'http://localhost:9200/**htmltest/?pretty=1http://localhost:9200/htmltest/?pretty=1'
-d '
{
"mappings" : {
"member" : {
"properties" : {
** "text" : {
** "type" : "string",
** "analyzer" : "htest"
** }
}
}
},
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
** "htest" : {
** "filter" : [
** "standard",
** "lowercase",
** "stop",
** "asciifolding"
** ],
** "char_filter" : [
** "html_strip"
** ],
** "tokenizer" : "standard"
** }
}
}
}
}
}'
Then added a couple of docs:
curl -XPUT 'http://localhost:9200/**htmltest/post/1http://localhost:9200/htmltest/post/1'
-d '{"title": "dilbert", "text": "Search is hard. Search
should be easy."}'
curl -XPUT '
http://localhost:9200/**htmltest/post/2http://localhost:9200/htmltest/post/2'
-d '{"title": "dilbert", "text": "Search is hard. Search should very, very
be easy."}'
Now, if I search for "blah", I should not get anything, right? But it
fails:
curl -XGET 'http://localhost:9200/**htmltest/post/_search?pretty=1http://localhost:9200/htmltest/post/_search?pretty=1
' -d '{"fields":["title"],"query":{"bool":{"should":[{"text":{"**
text":{"query":"blah"}}}]}},"highlight":{"fields":{"text":{
"fragment_size":200}}}}'
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.095891505,
"hits" : [ {
"_index" : "htmltest",
"_type" : "post",
"_id" : "1",
"_score" : 0.095891505,
"fields" : {
"title" : "dilbert"
},
"highlight" : {
"text" : [ "Search is hard. Search <div id=blah> should
very, very be easy." ]
}
} ]
}
}
--
--
--