Hi all,
I'm using Elasticsearch for my thesis to develop a new index pruning algorithm and making benchmarking different index pruning methodologies.
Hardware:
I'm using Elasticsearch 6.0 on MacBBook Pro (CPU 2.6 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3) and for data storage I'm using external HDD (Seagate Backup Plus Slim External Hard Drive 2TB 2.5 "USB 3.0 Black) since according to my calculations I need 1.5 TB disk space.
Config:
All I changed is path.data: /Volumes/THESIS/elasticsearch/
jvm.options:
-Xms8g
-Xmx8g
Index Definitions:
I have 15 different indices and here you can find one example definition, the others are very similar
PUT term_stop_words
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"analyzer_stopwords": {
"type": "english",
"tokenizer": "lowercase"
}
}
}
}
},
"mappings": {
"document": {
"_source": {
"excludes": [
"url",
"content",
"title"
]
},
"properties": {
"content": {
"type": "text"
},
"url": {
"type": "text"
},
"title": {
"type": "text"
},
"id": {
"type": "long"
}
}
}
}
}
// and for each index, I set refresh interval to -1
PUT term_stop_words/_settings
{
"index" : {
"refresh_interval" : "-1"
}
}
Java Class:
I'm using below class for indexing:
public class Document {
// getters & setters
private long id;
private String url;
private String content;
private String title;
}
BulkProcessor Setup:
BulkProcessor bulkProcessor = BulkProcessor.builder(
client,
new BulkProcessor.Listener() {
@Override
public void beforeBulk(long executionId,
BulkRequest request) { }
@Override
public void afterBulk(long executionId,
BulkRequest request,
BulkResponse response) {
if (response.hasFailures()) {
logger.error(response.buildFailureMessage());
}
}
@Override
public void afterBulk(long executionId,
BulkRequest request,
Throwable failure) {
logger.info("Flushed (bytes)" + request.estimatedSizeInBytes() + " for size "
+ request.numberOfActions());
}
})
.setBulkActions(500)
.setConcurrentRequests(4)
.setBulkSize(new ByteSizeValue(5, ByteSizeUnit.MB))
.setBackoffPolicy(BackoffPolicy.exponentialBackoff(TimeValue.timeValueMillis(100), 3))
.build();
Feeding BulkProcessor:
for (Document document : DocArrayWithSize15k) {
IndexRequest indexRequest = new IndexRequest("stop_words", "document");
indexRequest.source(gson.toJson(document), XContentType.JSON);
indexRequest.id(MD5Util.toMD5(document.getUrl()));
bulkProcessor.add(indexRequest);
}
After running with this setup and indexing 100k document per index, bulkProcessor.add
takes longer time, performance slows down dramatically. At first this loop takes 3 seconds but after some time it becomes 20 seconds and then 200 seconds when each index holds 500k documents.
I checked the logs and I couldn't see any exception but it contains a lot of lines with [2017-12-03T19:23:17,362][INFO ][o.e.m.j.JvmGcMonitorService] [quKHxho] [gc][2676] overhead, spent [267ms] collecting in the last [1s]
log.
I think there is something really bad with my configuration otherwise bulking 15k documents in 200 seconds is impossible, isn't it?
PS: Average doc size is 10 kbytes.