List of characters that must be escaped?


(Matthew Allan) #1

Hello,

While indexing some user provided content I have come across some unicode characters that cannot be indexed without escaping them first. For example:

DELETE /my_index

PUT /my_index
{
  "mappings": {
    "thing": {
      "properties": {
        "content": {
          "type": "text"
        }
      }
    }
  }
}

POST /my_index/thing/1
{
  "content": "hello 	  	 "
}

This results in the following error on 5.6:

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse [content]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse [content]",
    "caused_by": {
      "type": "json_parse_exception",
      "reason": "Illegal unquoted character ((CTRL-CHAR, code 9)): has to be escaped using backslash to be included in string value\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@6a67109; line: 2, column: 22]"
    }
  },
  "status": 400
}

The example above is using the unicode character Information Separator Three. Is there a list anywhere of which characters must be escaped before indexing? It doesn't seen to include entire unicode 'other, control character' as I can index U+0080 without issue.