Serialization issue on arrays

Aurelien_3 · January 21, 2016, 10:01am

Hi all,

I'm trying to dump my ES to hadoop to let me work on the data and not bother anymore my cluster.

I did a simple job in MR to drop data to HDFS, but it fails with arrays. My ES data has arrays and array might be null or empty sometimes.

I get this error :
2016-01-20 18:33:22,679 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.NullPointerException
at org.apache.hadoop.io.ArrayWritable.write(ArrayWritable.java:105)
at org.elasticsearch.hadoop.mr.WritableArrayWritable.write(WritableArrayWritable.java:60)
at org.apache.hadoop.io.MapWritable.write(MapWritable.java:161)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:98)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:82)
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1329)
at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:83)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:658)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)

When my mapper is a simple identity mapper.

And my configuration :
conf.set("es.read.metadata", "false");
conf.set("dfs.replication", "3");
conf.set("mapreduce.client.submit.file.replication", "3");
conf.set("es.nodes.data.only" , "false");
conf.set("es.nodes.discovery", "false");
conf.set("mapreduce.job.maps", "7");
conf.set("es.scroll.keepalive", "20m");
conf.set("es.mapping.include", "id, contributors, contributors_tags, url");
conf.set("es.field.read.empty.as.null", "true");

do you have any clue on this ?

Aurelien

costin · January 22, 2016, 8:25am

Looks like there's a bug in Hadoop with ArrayWritables that are empty - what version of ES are you using?

Aurelien_3 · January 22, 2016, 10:13am

I use ES 1.7.2 with the libraries :

org.elasticsearch elasticsearch 1.7.2

and

org.elasticsearch elasticsearch-hadoop 2.1.2

I'm using Hadoop by HortonWorks : 2.7.1.2.3.2.0-2950

Is it an issue with version compatibility ?

costin · January 22, 2016, 10:55am

It's not a compatibility issue, it's a bug in Hadoop (empty ArrayWritables, which are valid and can be constructed cannot be serialized). I've pushed a fix for this in master; the related issue can be found here:

Aurelien_3 · February 8, 2016, 12:49pm

Actually your fix seems not to work for me. I have this issue :

2016-02-08 12:04:34,101 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.NullPointerException
at org.apache.hadoop.io.ArrayWritable.write(ArrayWritable.java:105)
at org.elasticsearch.hadoop.mr.WritableArrayWritable.write(WritableArrayWritable.java:60)
at org.apache.hadoop.io.MapWritable.write(MapWritable.java:161)
at org.apache.hadoop.io.AbstractMapWritable.copy(AbstractMapWritable.java:115)
at org.apache.hadoop.io.MapWritable.(MapWritable.java:55)

I'm using 2.2.0 version from maven repository.

It still get into this part :

@Override
public void write(DataOutput out) throws IOException {
out.writeInt(values.length); // write values
for (int i = 0; i < values.length; i++) {
values[i].write(out);
}
}

I don't understand how to handle this. The only way was to clean my ES index prior to do it, but this is not really easy to perform that way. This is not reliable.

Aurelien_3 · February 8, 2016, 8:16pm

Ok, I finally found out the issue.

When I have a document with a field array where the array does contains "null" value, I get this error. Actually, this is a new error. First time I got this one.

costin · February 21, 2016, 6:26pm

Can you indicate what causes the error in the first place? How do you get a null into the array and what is the actual exception?

You could raise an issue over at github to track the code fix as well.

Thanks,

Aurelien_3 · April 12, 2016, 10:36am

Could not reproduce bug!

Thanks for help.

Aurelien_3 · May 23, 2016, 3:39pm

Sorry, I've haven't been able to identify the source of the concrete issue. I've managed to overcome it using input.json : true.

Using directly json format reveals to be easier and overcome serialization issue I encountered in Hadoop with ES.

Thanks

Topic		Replies	Views
ElasticSearch+Hadoop+Spark Elasticsearch	2	964	July 6, 2017
ElasticSearch hadoop - .EsHadoopSerializationException Elasticsearch	5	927	July 6, 2017
ES-hadoop serialize org.apache.hadoop.io.ShortWritable failed Elasticsearch es-hadoop	4	868	December 13, 2021
Writing to Elasticsearch from HDFS using Map/Reduce Elasticsearch es-hadoop	2	1598	July 6, 2017
Problem when writing to elasticsearch using ES-Hadoop Elasticsearch es-hadoop	2	1010	July 6, 2017

Serialization issue on arrays

Related topics