Removing '.' (DOT) character from field name using ES-Hadoop SerDe

(Gowtham Sadasivam) #1

I am inserting data from Hive table to Elastic Search using ES-Hadoop SerDe. The data are multiple JSON files containing highly sparse fields. I am using the mapping like "app map<string, string>" for all the JSON root elements. The problem is some times the nested field name contains '.' (DOT) character and it just terminates the entire Hive job since Elastic Search cannot take a field name with a dot character.

For Example:

app {
  "": "efT3Fg5JnvJVs57IOnc"

^ Here the field name "" contains '.' dot character. The error will be:

Caused by: Found unrecoverable error [] returned Bad Request(400) - Field name [] cannot contain '.'; Bailing out..

I have solved the similar problem when I was using "logstash" with a piece of Ruby code to replace all the DOT characters with underscore character (As discussed here)

Is there any option/configuration in ES-Hadoop SerDe to replace DOT character with any other character before pushing into Elastic Search or Just eliminate the fields that contain DOT in field name?


How to read JSON files stored in HDFS via Logstash
(Costin Leau) #2

No, not at the moment and it is unlikely there will be one.
The DOT restriction has affected a large number of folks and it is not a decision that was taken lightly. Do note that work is underway to improve the situation in ES 5.x as mentioned here.
Since this affects ES 2.x the issue is how to convert the dot - and how to handle reading it back. It's a not a clean situation and one that ES-Hadoop tries to move away, namely to abstract ES.
As a field name is used in various places, hiding the DOT would work potentially only for inserts, in case of scripts for example a user would still have to be aware of it otherwise the field name will not be recognized.
Hence my reluctance in adding some kind of 'translator' that removes the DOT.

(system) #3