Bulkload performance issue

Now we used JAVA to bulk load documents to Elasticsearch. We planned to import 10m documents each document size is almost 8M. Now we only can import 400K documents each day/ 5 documents every second. Our ES infrastructure is 3 master node with 4G ES_JAVA_OPTS(heap size) 2 data nodes and 2 client nodes with 2G memory. When I want to increase the speed of bulkload, we will get over heap size issue. Any advise for the improvement?
The disk I/O of the node is below. we set up the es cluster on Kubernetes.
dd if=/dev/zero of=/data/tmp/test1.img bs=1G count=10 oflag=dsync
10737418240 bytes (11 GB) copied, 50.7528 s, 212 MB/s

dd if=/dev/zero of=/data/tmp/test2.img bs=512 count=100000 oflag=dsync
51200000 bytes (51 MB) copied, 336.107 s, 152 kB/s

    for (int x =0; x<200000;x++) {
        BulkRequest bulkRequest = new BulkRequest();
        for (int k = 0; k < 50; k++) {
            Order order = generateOrder();
            IndexRequest indexRequest = new IndexRequest("orderpot", "orderpot");
            Object esDataMap = objectToMap(order);
            String source = JSONObject.valueToString(esDataMap);
            indexRequest.source(source, XContentType.JSON);
        rhlclient.bulk(bulkRequest, RequestOptions.DEFAULT);

Indexing performance will depend on hardware as well as size and complexity of your documents, and 8MB is massive. Why are they so large? How are you going to query these huge documents?

Disk speed can also play a part, but I am not sure to what extent that is the case here as I have never indexed massive documents like that.

Given the size of your documents and volume, I would not be surprised if you needed a more heap on your data nodes.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.