Indexing large html fields / cluster instability

Hi Folks!

We're having some performance/stability problems in our cluster while indexing data. There is especially two fields with pretty large html content with a custom analyzer as below.

The contents in those fields are around 1Mb - 3Mb large.

What we're seeing is nodes dropping out of the cluster frequently while adding docs. Logs show longish garbage collections.. The cluster is 5 nodes of 31Gb heap.

Any suggestions to make this easier on the cluster? I don't mind it being slow, but instability i want to avoid.

"html_standard": {
"filter": [
"lowercase"
],
"char_filter": [
"html_strip"
],
"tokenizer": "standard"
}

What version of Elasticsearch are you using? It's possible that the indexing is causing memory problems, but far more likely that it's your queries/aggregations.

  • How many documents are you sending per-bulk? Do you constrain the bulk size so that it doesn't go over n mb-per-bulk?
  • How many concurrent processes/threads are sending bulk requests
  • What kind of query/aggregations are you running?

Hi, thanks for the reply.

elastic version 5.1.1

I'm using the _reindex api right now, not sure about parallelism and bulk size actually.

Hardly any queries actually today. I quite consistently see these problems when indexing (or reindexing) or updating those documents.

Ah, I see. I'd try lowering the batch size of Reindex, the default is 1000. If your docs are 1-3mb, you could be hitting your cluster with 1-3gb bulk requests, which will definitely make the heap unhappy (it has to buffer up that entire request in newgen memory before parsing and sending to various shards).

Try setting it something like 50 to start, and work up from there:

POST _reindex
{
  "source": {
    "index": "source",
    "size": 50
  },
  "dest": {
    "index": "dest"
  }
}

Ah, yes of course. Thanks.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.