ElasticSearch Spark MLLib connectivity


#1

Hello, I want to use MLLib APIs on ElasticSearch Data. I have read data from ElasticSearch using ES-Hadoop Library.
Data is in JavaPairRDD<String, Map<String, Object>> format. MLLib API need data in JavaRDD<LabeledPoint>.
We are managed to convert data into RDD<LabeledPoint> but it seems that converted data is not in correct format. Conversion code is given below.

static class Transformer implements
		Function<Tuple2<String, Map<String, Object>>, LabeledPoint> {

	@Override
	public LabeledPoint call(Tuple2 arg0) throws Exception {
		HashingTF tf = new HashingTF();
		Map<String, Object> map = (Map<String, Object>) arg0._2();
		//get values from Map
		Set<String> keys = map.keySet();
		List<Object> valuesList = new ArrayList<Object>();
	    for (Iterator<String> i = keys.iterator(); i.hasNext();) {
	      String key = (String) i.next();
	      Object value = (Object) map.get(key);
	      valuesList.add(value);
	    }
		return new LabeledPoint(1d, tf.transform(valuesList));
	}
}

Using above function I managed to create Random Forest Model. But while loading the saved model I am getting following exception

org.json4s.package$MappingException: Did not find value which can be converted into java.lang.String

Also in all example of Spark MLLib, LIBSVM files are provided for machine learning. But in our case we are reading data from ElasticSearch. So I have 2 queries,

  1. Is there any way to convert JavaPairRDD to LIBSVM format. Which will be used by MLLib APIs
  2. Is above Transformer function has any problem?

(system) #2