BulkProcessor performance problem

Hi all,

I'm using Elasticsearch for my thesis to develop a new index pruning algorithm and making benchmarking different index pruning methodologies.

Hardware:
I'm using Elasticsearch 6.0 on MacBBook Pro (CPU 2.6 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3) and for data storage I'm using external HDD (Seagate Backup Plus Slim External Hard Drive 2TB 2.5 "USB 3.0 Black) since according to my calculations I need 1.5 TB disk space.

Config:
All I changed is path.data: /Volumes/THESIS/elasticsearch/

jvm.options:

-Xms8g
-Xmx8g

Index Definitions:
I have 15 different indices and here you can find one example definition, the others are very similar

PUT term_stop_words
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "analyzer_stopwords": {
            "type": "english",
            "tokenizer": "lowercase"
          }
        }
      }
    }
  },
  "mappings": {
    "document": {
      "_source": {
        "excludes": [
          "url",
          "content",
          "title"
        ]
      },
      "properties": {
        "content": {
          "type": "text"
        },
        "url": {
          "type": "text"
        },
        "title": {
          "type": "text"
        },
        "id": {
          "type": "long"
        }
      }
    }
  }
}

// and for each index, I set refresh interval to -1

PUT term_stop_words/_settings
{
    "index" : {
        "refresh_interval" : "-1"
    }
}

Java Class:
I'm using below class for indexing:

public class Document {
   // getters & setters
   private long id;
   private String url;
   private String content;
   private String title;
}

BulkProcessor Setup:

BulkProcessor bulkProcessor = BulkProcessor.builder(
                client,
                new BulkProcessor.Listener() {
                    @Override
                    public void beforeBulk(long executionId,
                                           BulkRequest request) {  }

                    @Override
                    public void afterBulk(long executionId,
                                          BulkRequest request,
                                          BulkResponse response) {
                        if (response.hasFailures()) {
                            logger.error(response.buildFailureMessage());
                        }
                    }

                    @Override
                    public void afterBulk(long executionId,
                                          BulkRequest request,
                                          Throwable failure) {
                        logger.info("Flushed (bytes)" + request.estimatedSizeInBytes() + " for size "
                                + request.numberOfActions());
                    }
                })
                .setBulkActions(500)
                .setConcurrentRequests(4)
                .setBulkSize(new ByteSizeValue(5, ByteSizeUnit.MB))
                .setBackoffPolicy(BackoffPolicy.exponentialBackoff(TimeValue.timeValueMillis(100), 3))
                .build();

Feeding BulkProcessor:

for (Document document : DocArrayWithSize15k) {
   IndexRequest indexRequest = new IndexRequest("stop_words", "document");
   indexRequest.source(gson.toJson(document), XContentType.JSON);
   indexRequest.id(MD5Util.toMD5(document.getUrl()));
   bulkProcessor.add(indexRequest);
}

After running with this setup and indexing 100k document per index, bulkProcessor.add takes longer time, performance slows down dramatically. At first this loop takes 3 seconds but after some time it becomes 20 seconds and then 200 seconds when each index holds 500k documents.

I checked the logs and I couldn't see any exception but it contains a lot of lines with [2017-12-03T19:23:17,362][INFO ][o.e.m.j.JvmGcMonitorService] [quKHxho] [gc][2676] overhead, spent [267ms] collecting in the last [1s] log.

I think there is something really bad with my configuration otherwise bulking 15k documents in 200 seconds is impossible, isn't it?

PS: Average doc size is 10 kbytes.

10kb per document is somewhat "big".
Few things to try:

  • Start with .setConcurrentRequests(1).
  • Try to reduce .setBulkActions(100)
  • Remove that .setBulkSize(new ByteSizeValue(5, ByteSizeUnit.MB))
  • Increase a bit BackoffPolicy.exponentialBackoff(TimeValue.timeValueSeconds(1), 3)

But one of the problem I can think about is:

for data storage I'm using external HDD (Seagate Backup Plus Slim External Hard Drive 2TB 2.5 "USB 3.0 Black)

Could you try with the local SSD drive first?

David thanks a lot for the suggestions. I think you are right, the bottleneck is absolutely is using external spinning disk. So I tried the same code with using internal ssd disk I get 100 times faster results. So I better prepare index by creating small parts by using local disk and then move them to external disk makes sense.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.