I've configured a default custom analyzer as follows:
index :
analysis :
filter :
snowball :
type : snowball
language : English
wd_filter :
type : word_delimiter
generate_word_parts : true
generate_number_parts : true
catenate_words : true
split_on_case_change : true
preserve_original : true
split_on_numerics: true
analyzer :
default :
type : custom
tokenizer : uax_url_email
filter : [lowercase,snowball,wd_filter]
char_filter : [html_strip]
But when I index a doc into 'content' and then examine the indexed
files with:
curl -XGET 'http://localhost:9200/mgs/p2/385?fields=content&pretty'
I still am seeing all of the html tags in the text. Reading back the
_mapping for the "content" field it is:
"content" : {
"store" : "yes",
"type" : "string"
},
'index' defaults to 'analyzed', so this appears correct.
Further complicating is that if I run:
curl -XGET 'http://localhost:9200/mgs/_analyze' -d '
this is
a tests
Then I get:
{"tokens":[{"token":"this","start_offset":3,"end_offset":
7,"type":"","position":1},{"token":"is","start_offset":
11,"end_offset":13,"type":"","position":2},
{"token":"a","start_offset":18,"end_offset":
19,"type":"","position":3},{"token":"test","start_offset":
20,"end_offset":25,"type":"","position":4}]}
Which appears correct as the tags have been stripped out by using the
default analyzer, and the word 'tests' has been stemmed to 'test'.
So the analyzer seems to be working correctly except for when I
actually add a document. I am using Elastica for adding documents,
but I tried
curl -XPUT 'http://localhost:9200/mgs/p2/0' -d '{"content" : "
This is a tests
" }'
Then:
curl -XGET 'http://localhost:9200/mgs/p2/0?fields=content'
Gives:
{"_index":"mgs","_type":"p2","_id":"0","_version":2,"fields":
{"content":"This is a tests
"}}
So adding the document does not seem to be causing the analyzer to be
run on the added document. Any ideas on what I am missing/doing
wrong?
Thanks
-Greg