Writing to Elasticsearch from HDFS using Map/Reduce

Mahla · May 25, 2016, 12:07am

I'm trying to write some files, which are stored on HDFS, to ElasticSearch by using hadoop map reduce. I have one mapper and no reducers and the files are in JSON format.

When I run my code, 800 reducers starts running and when they reach 84%, the job is failed and I got the error:
Error: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried ...

However, when I use "conf.setNumReduceTasks(0)" in my java code, the mapping does not proceed at all and it stocked at 0% with error:
Error: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [...] returned Internal Server Error(500) - [RemoteTransportException[[...][inet[]][indices:data/write/bulk[s]]]; nested: NullPointerException; ]; Bailing out..

I used the following settings for my JobConf:

                conf.addResource(...);
		conf.setMapperClass(Map.class);
		conf.setMapOutputValueClass(Text.class);
		conf.setMapOutputKeyClass(Text.class);
		conf.setSpeculativeExecution(false);
		conf.setNumMapTasks(1);
		conf.set("es.nodes", ES_NODES);
                conf.set("es.resource.write", "...");
                conf.set("mapred.output.compress", "true");
                //conf.setNumReduceTasks(0);
		
		
		 //es
		 conf.set("es.input.json", "yes");
		 conf.set("es.write.operation", "index");
		 conf.set("es.index.auto.create", "yes");
		 conf.set("es.field.read.validate.presence", "warn");
		 conf.set("es.batch.write.retry.count", "10");
		 conf.setOutputFormat(EsOutputFormat.class);

Can somebody please tell me what other things I should set for my configuration to avoid these errors?

Thanks!

costin · June 1, 2016, 11:33am

The RemoteTransportException followed by the NPE is conspicuous - looks like a bug inside ES itself; can you please provide more information about that message/log and the version of ES used.

As for the initial problem - it is highly likely that you are overloading ES and the job eventually fails as the nodes become unresponsive or because there are too many retries.
You could try to either limit the number of reducers or use something like CombineTextInputFormat to basically combine the inputs/splits in Hadoop and thus reduce the number of mappers (and thus the resulting number of reducers) in your job.

Topic		Replies	Views
Unable to write existing json from HDFS to elasticsearch using MapReduce Elasticsearch es-hadoop	5	1644	June 8, 2017
Load data from spark to ElasticSearch Hadoop Elasticsearch es-hadoop	1	1093	July 6, 2017
Problem when writing to elasticsearch using ES-Hadoop Elasticsearch es-hadoop	2	1010	July 6, 2017
Serialization issue on arrays Elasticsearch es-hadoop	9	2878	July 6, 2017
ElasticSearch hadoop - .EsHadoopSerializationException Elasticsearch	5	930	July 6, 2017

Writing to Elasticsearch from HDFS using Map/Reduce

Related topics