Downloading large amount of logs as CSV using Kibana/Eland

Hi, I am using Elastic Cloud and am trying to download a large among of logs (over 800 million lines of logs, over span of a few months) as CSV. However when I tried downloading from Discover > Share > Generate CSV, it does not reflect everything and the file size is always 10MB when downloaded. Is there some limitations/cap in place to prevent exporting large amount of logs in Kibana?

I tried to convert it to csv using Eland Dataframe to_csv() function, but everytime I do it i will face this error: NotFoundError(404, 'search_phase_execution_exception', 'No search context found for id [id_num]') so am trying a different way but either way seems to cause an issue to do the large number of logs.

Please help.

Hi @xynobob,

There are a couple of alternatives to download large volumes of logs. I would take a look at this related thread to see if either of these approaches work for you.

Let us know how you get on!

Hi @carly.richmond ,

Went to read up on the thread you link. The first option of using Python, I am still encountering the context error mentioned in the initial post. The second option, I don't think I am able to use Logstash as I am not using an ELK stack but Elastic Cloud itself. Is there any other simpler way you can suggest to just download the 800 million rows of logs as csv? This is because it seems the main issue I encounter using either Python way (Eland) or Kibana is attributed to the large volume of data. Any help would be appreciated as I am in an urgent need to extract these logs.

Tried to contact the Elastic Support for help (as I am an Enterprise-tier customer) but was directed to ask technical questions on this Forum instead, which is honestly disappointing. So am really in need of any help, if possible.

Hi @xynobob,

Sorry to hear that Elastic Support were not of much help. We're happy to do our best efforts here!

There is a 10MB limit by default which can be configured via the xpack.reporting.csv.maxSizeBytes setting in kibana.yml for your cluster as per this thread. You'll see in the docs that we recommend if you need to export more than 250MB you might want to consider exporting in smaller batches, which you could do by splitting your requests across multiple timeframes.

For the first option of Python, are you using the Eland ML client or the Elasticsearch Python client? Your original message mentioned the Eland client, but the thread is referencing the latter.

That aside, having a look at the docs I see a couple of other options you could try:

  1. Export via the Point in time API.
  2. SQL with the CSV response format

Let us know if any of those options, potentially in batches, help you get the data you need.

Hi @carly.richmond ,

Thanks for the other recommendations.

Tried to change the xpack.reporting.csv.maxSizeBytes to the cap which is shown to be 2147483647 , tried to generate the csv again and face this error: Encountered an error with the number of CSV rows generated from the search: expected 158012792, received 91500.

For the Python, I tried and encountered this error BadRequestError: BadRequestError(400, 'search_phase_execution_exception', 'Result window is too large, from + size must be less than or equal to: [10000] but was [800000000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.')
Probably due to the large volume.

Looked into the Point in time API, but have no idea how to use it to get the the .csv. However, I did try to use the Scan API on Jupyter Notebook to write the data into .csv but after a few hours (and generated a 3GB .csv file) it showed this error - ConnectionError: Connection error caused by: ConnectionError(Connection error caused by: NewConnectionError(<urllib3.connection.HTTPSConnection object at 0x00000191EDF72710>: Failed to establish a new connection: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions)) and unfortunately the output in the csv is gibberish, not sure what went wrong.

The Scan API Python Code:

import time
import elasticsearch.helpers

with open('scandocs.csv','w') as f_out:
    scan_start_time = time.time()
    hits = elasticsearch.helpers.scan(es, index='logs*', query=None, scroll='20h', clear_scroll=True, size=5000)
    scan_duration = time.time()-scan_start_time
    print(scan_duration)

    loop_start_time = time.time()
    for hit in hits:
        scan_source_data = hit["_source"]
        scan_id = hit["_id"]
        output_line = '{},'.format(scan_id)
        output_line += ','.join(scan_source_data)
        f_out.write(output_line+'\n')

loop_duration = time.time()-loop_start_time
print(loop_duration)

The output:

One important thing to note is that I need these logs for Machine Learning purposes so as long as I am able to extract it as .csv or as pandas Dataframe (extracting to Eland Dataframe works but is very limited on its ML capilities, thus need the data to be in pandas), it will suffice. But both ways of me trying to extract as a .csv [via Kibana, Python through to_csv()] or pandas Dataframe [using to_pandas() from Eland Dataframe] causes the same issue of NotFoundError(404, 'search_phase_execution_exception', 'No search context found for id [id_num]') . I have been trying to find solution to this issue for the past 2 weeks but unable to do so.

So appreciate any help from you or any of your colleague that's familiar with this.

I encountered a similar issue recently with a Go snippet and found there was an issue with my query that wasn't being reported in the error message by the client. Have you checked your query is working in the Elasticsearch DevTools console?

Hi @carly.richmond

I believe there isn't any issues with the "query" because it was just a df.to_pandas() command (not too sure how to try that in DevTools). Below is what the code looks like in Jupyter -

Its the same for to_csv() where it will just be df.to_csv()

Ok, thanks for sharing. I wonder if you should raise a GitHub issue for the Eland client to eliminate a potential bug. But I think you would need additional error information to figure out what the issue is there.

So I think with the volume you're trying to export you have 2 options left:

  1. Try the scroll API as suggested by the above error via Python scroll
  2. Split your export into smaller batches and append each batch result to a csv

Thanks for the suggestion. WIll try to raise an issue on GitHub to see if its a bug despite the lack of error information.

I am looking into scan API which abstracts scroll API as I can't seem to find many resources online to write scroll API results to csv. Splitting my export will take way too long as I have way too many logs.

Also, may I just ask whether is this a normal use case in Elastic for users to export millions to billions rows of logs? Because it seems that the existing methods to extract large volume logs are a little too complicated for those new to Elastic. Was hoping for a simpler and straight forward method.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.