Serialization issue on arrays


(Aurélien-3) #1

Hi all,

I'm trying to dump my ES to hadoop to let me work on the data and not bother anymore my cluster.

I did a simple job in MR to drop data to HDFS, but it fails with arrays. My ES data has arrays and array might be null or empty sometimes.

I get this error :
2016-01-20 18:33:22,679 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.NullPointerException
at org.apache.hadoop.io.ArrayWritable.write(ArrayWritable.java:105)
at org.elasticsearch.hadoop.mr.WritableArrayWritable.write(WritableArrayWritable.java:60)
at org.apache.hadoop.io.MapWritable.write(MapWritable.java:161)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:98)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:82)
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1329)
at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:83)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:658)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)

When my mapper is a simple identity mapper.

And my configuration :
conf.set("es.read.metadata", "false");
conf.set("dfs.replication", "3");
conf.set("mapreduce.client.submit.file.replication", "3");
conf.set("es.nodes.data.only" , "false");
conf.set("es.nodes.discovery", "false");
conf.set("mapreduce.job.maps", "7");
conf.set("es.scroll.keepalive", "20m");
conf.set("es.mapping.include", "id, contributors, contributors_tags, url");
conf.set("es.field.read.empty.as.null", "true");

do you have any clue on this ?

Aurelien


(Costin Leau) #2

Looks like there's a bug in Hadoop with ArrayWritables that are empty - what version of ES are you using?


(Aurélien-3) #3

I use ES 1.7.2 with the libraries :

org.elasticsearch elasticsearch 1.7.2

and

org.elasticsearch elasticsearch-hadoop 2.1.2

I'm using Hadoop by HortonWorks : 2.7.1.2.3.2.0-2950

Is it an issue with version compatibility ?


(Costin Leau) #4

It's not a compatibility issue, it's a bug in Hadoop (empty ArrayWritables, which are valid and can be constructed cannot be serialized). I've pushed a fix for this in master; the related issue can be found here:


(Aurélien-3) #5

Actually your fix seems not to work for me. I have this issue :

2016-02-08 12:04:34,101 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.NullPointerException
at org.apache.hadoop.io.ArrayWritable.write(ArrayWritable.java:105)
at org.elasticsearch.hadoop.mr.WritableArrayWritable.write(WritableArrayWritable.java:60)
at org.apache.hadoop.io.MapWritable.write(MapWritable.java:161)
at org.apache.hadoop.io.AbstractMapWritable.copy(AbstractMapWritable.java:115)
at org.apache.hadoop.io.MapWritable.(MapWritable.java:55)

I'm using 2.2.0 version from maven repository.

It still get into this part :

@Override
public void write(DataOutput out) throws IOException {
out.writeInt(values.length); // write values
for (int i = 0; i < values.length; i++) {
values[i].write(out);
}
}

I don't understand how to handle this. The only way was to clean my ES index prior to do it, but this is not really easy to perform that way. This is not reliable.


(Aurélien-3) #6

Ok, I finally found out the issue.

When I have a document with a field array where the array does contains "null" value, I get this error. Actually, this is a new error. First time I got this one.


(Costin Leau) #7

Can you indicate what causes the error in the first place? How do you get a null into the array and what is the actual exception?

You could raise an issue over at github to track the code fix as well.

Thanks,


(Aurélien-3) #8

Could not reproduce bug!

Thanks for help.


(Aurélien-3) #9

Sorry, I've haven't been able to identify the source of the concrete issue. I've managed to overcome it using input.json : true.

Using directly json format reveals to be easier and overcome serialization issue I encountered in Hadoop with ES.

Thanks


(system) #10