Hi @carly.richmond ,
Thanks for the other recommendations.
Tried to change the xpack.reporting.csv.maxSizeBytes to the cap which is shown to be 2147483647 , tried to generate the csv again and face this error: Encountered an error with the number of CSV rows generated from the search: expected 158012792, received 91500.
For the Python, I tried and encountered this error BadRequestError: BadRequestError(400, 'search_phase_execution_exception', 'Result window is too large, from + size must be less than or equal to: [10000] but was [800000000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.')
Probably due to the large volume.
Looked into the Point in time API, but have no idea how to use it to get the the .csv. However, I did try to use the Scan API on Jupyter Notebook to write the data into .csv but after a few hours (and generated a 3GB .csv file) it showed this error - ConnectionError: Connection error caused by: ConnectionError(Connection error caused by: NewConnectionError(<urllib3.connection.HTTPSConnection object at 0x00000191EDF72710>: Failed to establish a new connection: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions))
and unfortunately the output in the csv is gibberish, not sure what went wrong.
The Scan API Python Code:
import time
import elasticsearch.helpers
with open('scandocs.csv','w') as f_out:
scan_start_time = time.time()
hits = elasticsearch.helpers.scan(es, index='logs*', query=None, scroll='20h', clear_scroll=True, size=5000)
scan_duration = time.time()-scan_start_time
print(scan_duration)
loop_start_time = time.time()
for hit in hits:
scan_source_data = hit["_source"]
scan_id = hit["_id"]
output_line = '{},'.format(scan_id)
output_line += ','.join(scan_source_data)
f_out.write(output_line+'\n')
loop_duration = time.time()-loop_start_time
print(loop_duration)
The output:
One important thing to note is that I need these logs for Machine Learning purposes so as long as I am able to extract it as .csv or as pandas Dataframe (extracting to Eland Dataframe works but is very limited on its ML capilities, thus need the data to be in pandas), it will suffice. But both ways of me trying to extract as a .csv [via Kibana, Python through to_csv()] or pandas Dataframe [using to_pandas() from Eland Dataframe] causes the same issue of NotFoundError(404, 'search_phase_execution_exception', 'No search context found for id [id_num]')
. I have been trying to find solution to this issue for the past 2 weeks but unable to do so.
So appreciate any help from you or any of your colleague that's familiar with this.