I'm pulling data from elastic search using python client scroll id and appending in a dataframe as follows
import pandas as pd
from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
index_columns = ['a','b','c',...............]
message_body = {"size": 1000, "_source": index_columns, "query": {"match_all": {}}}
elastic_data = es.search(index="data", body=message_body, scroll='1m')
at_data = pd.DataFrame([a['_source'] for a in elastic_data['hits']['hits']])
sid = elastic_data['_scroll_id']
scroll_size = len(elastic_data['hits']['hits'])
while scroll_size > 0:
elastic_data_rest = es.scroll(scroll_id=sid, scroll='1m')
at_data_rest = pd.DataFrame([a['_source'] for a in elastic_data_rest['hits']['hits']])
sid = elastic_data_rest['_scroll_id']
scroll_size = len(elastic_data_rest['hits']['hits'])
at_data = at_data.append(at_data_rest, ignore_index=True, sort=False)
above works good, but taking long time for big data
May I know whether sliced scroll with pool helps to pull faster or any other way available?
I gone through this
#817
and
https://www.codestudyblog.com/cnb2010/1006124017.html
and tried some, but no luck
How to use elasticsearch sliced scroll with multithreading in python? · Issue #1527 · elastic/elasticsearch-dsl-py (github.com)
Thanks