Elastic Search - JavaEsSpark read is slow


(Sukumar) #1

I am using spark 1.4.1 and trying to read from the elasticsearch , its taking 1600 ms but when I tried to retrieve the same query using SENSE ,its taking just 3 ms. Can anyone help me to improve the query performance ?

JavaPairRDD esResultPair = JavaEsSpark.esJsonRDD(ctx, "index/02",string);


(Thomas Decaux) #2

Why you are using Spark ?

Usually latency in "big data" stuff is 1 seconds / 1 minute, so I imagine this is quite normal and is not elasticsearch related (run job, schedule resources etc...).


(James Baiera) #3

Could you elaborate on what you're trying to read? Also could you elaborate on the query you are running in Sense?


(Sukumar) #4

I replaced JavaESspark with JestClient and its working now. One request is taking 1round 120 ms. Thanks.


(Sukumar) #5

I am running job using spark 1.4.1 and tried to replace the existing job with spark and elastic search, writing to elastic search using spark JavaESSpark is really fast but read is not as expected. Please find the simple query I ran in spark and SENSE.

"query" : {
"bool" : {
"must" : [ {
"match" : {
"full_name" : {
"query" : "JACQUELINE"
}
}
}, {
"match" : {
"full_name" : {
"query" : "COLWILL"
}
}
}, {
"bool" : {
"should" : [ {
"match" : {
"acct" : {
"query" : 0
}
}
}, {
"match" : {
"ids" : {
"query" : ""
}
}
} ]
}
} ]
}
}
}


(Thomas Decaux) #6

How did you replace like this ? I am very curious to see the code source and how you submit your Spark job.


(James Baiera) #7

Your query looks very specific, and thus will probably only retrieve a handful of results. It's important to note the difference between Sense and Spark. Sense is a GUI Client for Elasticsearch queries. Sense will only return the top ten results that match your query. It takes advantage of Elasticsearch's search features to do this incredibly fast (on the order of milliseconds). Spark is meant for heavy duty data processing. When using EsSpark, it targets a different search type which streams all of the data out of Elasticsearch for analysis in Spark. This tends to be a heavier request mechanic (operating over the course of multiple seconds).

If you are using this same query in Spark for reading, Spark will end up wasting a lot of time standing up multiple tasks to read the data from Elasticsearch, only get a few records, and then go through a costly job teardown process. EsSpark is not meant to be a fast client for retrieving very few records. It is meant to be a connector for data processing at scale. This specific of a query would probably be served better if it were executed from a regular application client.


(Sukumar) #8

Thanks for the clarification. Let me try with the large query for the large data set.


(system) #9