I have a small 3 node Elasticsearch cluster (each VM 8 Cores, 28 Gb RAM, 56 GB SSD). I am using Spark Streaming (Dstreams) to stream data from MQTT broker into ES. The index has 50M documents. When I use spark sql and create a dataframe from this index and do a count on the dataframe it takes 1 hr to return the result. If I write the dataframe as parquet file in HDFS the count returns in <1 sec. So I am suspecting there is some issue with the way I am using Elasticsearch for Hadoop. Not sure how to resolve this issue.
How does it take to run the count directly from ES? This test will show whether it's the connection between ES and Spark Streaming (Dstreams) that is slow.
It takes less than a second. There is no issue while writing the index using Dstreams; it is only when trying to read the index it is very slow. I even installed Spark on one of the ES nodes and ran the same query; still the response was very slow.
Interesting, so reading / running the count directly from ES is fine. However, when you read from the Spark console it takes a long time. It sounds like to me it might be a Spark pagination or network problem. Does the slow read only happen to one index or all your indices?
I have only two indices in the cluster. The other index is small with about 2 M documents. The counts on that is also slow; it takes about 23 secs
I discussed with Elastic yesterday and learnt that count(*) or any other form of aggregation is not pushed down by ES for Spark. They recommended to use the Java REST API within Spark code for aggregations.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.