Having trouble executing isin query from Pyspark to elastic

Ankit_gupta1 · March 26, 2018, 11:08am

Hi,

I've trying to implement a simple scenario to get data from elastic into my spark program.

I have two indexes
First one is search index using this i am fetching messageIds on the basis of a matching value from one attribute

#Fetching messageIds from the search Index matching search term

q ="""{
"query": {
"match": {"column":"valueXXXX"}
}
}"""

reader = sqlContext.read.format("org.elasticsearch.spark.sql").option("es.nodes",connection_url).option("es.query", q).
s_df = reader.load("seachIndex1.0/search")
df_messageIds = s_df.select('message_id').distinct().rdd.map(lambda r: r[0].encode("utf-8")).take(5000)

*****This index contains only MessageIds my scenario is to use this messageIds and fetch messages from otherindex

#Fetching messages from the messageIndex
q ="""{
"query": {
"match_all": {}
}
}"""

reader = sqlContext.read.format("org.elasticsearch.spark.sql").option("es.nodes",connection_url).option("es.query", q).option("pushdown","true")
m_df = reader.load("messageIndex1.0/message")
filtered_df = m_df.filter(col('system_id').isin(['513ed057-f40e-43f8-b675-8fb3886c7640', '68d79d86-a7e2-46de-9aaf-f89341e66fe4']) == True)
filtered_df.show(5000)

Issue is that in current scenario it is taking too much time to filter out messages on the basis of Ids

I tried joining two data-frames .
I tried using is in clause
Also i am not able to send list of MessageIds with query to elasticSearch

Also I tried pushdown feature somehow it is also not working with IN clause.

Please suggest how can i optimize this scenario

Thanks

system · April 23, 2018, 11:08am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problem with retrieving data from ES into Spark Elasticsearch es-hadoop	3	4221	July 6, 2017
ES-Hadoop PySpark error Elasticsearch es-hadoop	2	2170	January 10, 2018
Pyspark es.query not working. only default "match_all" works Elasticsearch es-hadoop	2	951	November 1, 2017
Reading elasticsearch data using spark SQL is too slow Elasticsearch es-hadoop	1	697	April 3, 2020
Get documents based on some filter conditions Elasticsearch es-hadoop	2	563	February 25, 2022

Having trouble executing isin query from Pyspark to elastic

Related topics