Something that boggles me in regard to how _analyze actually works:
Here's an example of request I'm running
GET /_analyze?char_filter=html_strip
{"I really can swim with my gill at the CafÃ! "}
What puzzles me in the output is the lack of ! character in the result due to the fact ! not being a html character. What's more, I'm actually seeing tokens there, which suggests there's a tokenizer being used.
My question: what are the default filters and tokenizer for _analyze handler and how can I turn them off?
Full output here:
{
"tokens": [
{
"token": "i",
"start_offset": 2,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "really",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "can",
"start_offset": 11,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "swim",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "with",
"start_offset": 20,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "my",
"start_offset": 25,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "gill",
"start_offset": 28,
"end_offset": 32,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "at",
"start_offset": 33,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "the",
"start_offset": 36,
"end_offset": 39,
"type": "<ALPHANUM>",
"position": 9
},
{
"token": "cafã",
"start_offset": 40,
"end_offset": 44,
"type": "<ALPHANUM>",
"position": 10
}
]
}