How to export large set data and write to csv using elasticsearch

We have an application allowing users to export records based on search/filter criteria. I am looking for a solution to implement the "export all" functionality to a CSV file.

I conducted a test utilizing the _scroll API. The initial test was successful, allowing us to export data exceeding 80,000 records and successfully write it into a CSV file.

When attempting to export more than 5 lakh records, encountered a failure. Additionally, I'm wondering if this could have any impact on our financial pricing or if we need to consider increasing memory resources.

Here is my code for your review. Please let me know if any refactoring is needed.



// Function to export data to a CSV file in batches
function exportToCsvInBatches($data, $csvFilename, $batchSize)
{
    $handle = fopen($csvFilename, 'w');

    // Write header
    fputcsv($handle, array_keys($data[0]));

    // Write data in batches
    $batchCount = ceil(count($data) / $batchSize);
    for ($i = 0; $i < $batchCount; $i++) {
        $batch = array_slice($data, $i * $batchSize, $batchSize);
        foreach ($batch as $row) {
            fputcsv($handle, $row);
        }
    }

    fclose($handle);
}

// Elasticsearch index and query
$index = 'list';
$query = [
    'bool' => [
        'must' => [
            ['term' => ['user_id' => 2431]],
            ['match' => ['archive' => 'N']],
            ['match' => ['list_source' => 'LO Portal']],  
        ],
    ],
];

// Parameters for Scroll API
$pageSize = 1000; // Number of documents per page
$scrollTime = '10m'; // Scroll time

// Batch size for exporting to CSV
$batchSize = 10000; // Adjust based on your requirements

try {
    // Initial search request using the Scroll API
    $response = executeScrollSearch($index, $query, $pageSize, $scrollTime, $client);

    // Array to store all data
    $allData = [];

    // Process the initial search results
    foreach ($response['hits']['hits'] as $hit) {
        $allData[] = $hit['_source'];
    }

    // Continue scrolling until there are no more hits
    while (isset($response['hits']['hits']) && count($response['hits']['hits']) > 0) {
        $scrollId = $response['_scroll_id'];

        // Perform the scroll request to get the next page
        $response = $client->scroll([
            'scroll_id' => $scrollId,
            'scroll' => $scrollTime,
        ]);

        // Process the next page of results
        foreach ($response['hits']['hits'] as $hit) {
            $allData[] = $hit['_source'];
        }
    }

    // Clear the scroll to release resources
    $client->clearScroll(['scroll_id' => $scrollId]);

    // Define the CSV filename
    $csvFilename = 'exported_data.csv';

    // Export all data to the CSV file in batches
    exportToCsvInBatches($allData, $csvFilename, $batchSize);

    echo 'Data exported to ' . $csvFilename . PHP_EOL;
} catch (BadRequest400Exception $e) {
    // Handle Elasticsearch-specific errors
    echo 'Elasticsearch Error: ' . $e->getMessage() . PHP_EOL;
} catch (Exception $e) {
    // Handle general PHP exceptions
    echo 'Error: ' . $e->getMessage() . PHP_EOL;
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.