Can you help to check this error please?


(Kramer Li) #1

I `m trying to read from data ElasticSearch to spark ?

conf = {"es.resource":"sflow_*/sflow","es.nodes":"ES01","es.query":'some query'}

rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat",    "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)

rdd.take(2)

After rdd.take(2) The process will stuck and issue the warn log like below

16/03/14 20:52:07 WARN httpclient.SimpleHttpConnectionManager: SimpleHttpConnectionManager being used
incorrectly.  Be sure that HttpMethod.releaseConnection() is always called and that only one thread and/or 
method is using this connection manager at a time.

But use rdd.first() will always return result successfully. Do you know why?


(Costin Leau) #2

Looks like a bug. EsInputFormat or any InputFormat for that matter should be used single-threaded-ly yet in your case it looks like that is not the case.
Are you using Python or Scala?


(Kramer Li) #3

Hi Costin

I am using python. So this is a bug not because I`m doing it in a wrong way? right?

Thanks verymuch


(Costin Leau) #4

Sorry, I don't know enough Python to be able to help. It's likely that the Hadoop Input/Output format are used incorrectly but looking at your code it doesn't seem that you are doing much so maybe there's something else going on in Spark/Python.

Sorry...


(system) #5