Hello,
While indexing some user provided content I have come across some unicode characters that cannot be indexed without escaping them first. For example:
DELETE /my_index
PUT /my_index
{
"mappings": {
"thing": {
"properties": {
"content": {
"type": "text"
}
}
}
}
}
POST /my_index/thing/1
{
"content": "hello "
}
This results in the following error on 5.6:
{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "failed to parse [content]"
}
],
"type": "mapper_parsing_exception",
"reason": "failed to parse [content]",
"caused_by": {
"type": "json_parse_exception",
"reason": "Illegal unquoted character ((CTRL-CHAR, code 9)): has to be escaped using backslash to be included in string value\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@6a67109; line: 2, column: 22]"
}
},
"status": 400
}
The example above is using the unicode character Information Separator Three. Is there a list anywhere of which characters must be escaped before indexing? It doesn't seen to include entire unicode 'other, control character' as I can index U+0080 without issue.