Debugging extremely slow indexing

Fair point indeed. Which is why I asked the OP if they could fire a bunch of simple curl indexing requests in loop to see how it fares. If the curl request works pretty quickly, then chances are that there's an issue at the client side.

Thank you for all the info everyone. As @DavidTurner mentioned the focus is on indexing speed. All of the suggestions are informative of what to do in the future though and I am grateful for those. There are a ton of shards out there but I have (theoretically) enough nodes/resources for all of them according to spec/recommendations. It makes me think I am missing something fundamentally that could re-occur if we set up a new cluster too, especially if we do not fully understand it.

In response to:

Also, the r4 instances are not optimised for I/o operations but rather for memory . Do you run heavy queries in this cluster?

We do not run a lot of heavy or complex querying in this application, we are much more write heavy.

I will work on getting those request numbers for you @sandeepkanabar and try to get more info in general for everyone shortly.

I have listed all the reasons I have been able to come up with for such poor performance based on what I have seen previously. Most of that is related to the data and cluster state rather than the cluster itself, e.g. use of dynamic mappings causing frequent slow cluster updates, use of large nested structures in documents etc.

As most nodes as far as I understand appear almost idle it might be interesting to try to load some simple, uniform data (no dynamic mappings) into a single test index with just a few primary shards to see if this shows the same behaviour or is significantly faster. It is probably worth indexing documents individually as well as in bulk. This might help pinpoint whether the problem is with the data, cluster or client.

2 Likes

Thank you for the information Christian! We will try running with some simpler data and see if this is the culprit. We ran the benchmark tests that @sandeepkanabar asked for with our current data last night. We did a batch of 100 docs and we did a batch of 1000 docs (each as a single request, no bulk API). Most documents, ~65%, took between 250-600ms each, all other docs took more time with the next most common duration being about 8 seconds. The fastest we indexed a doc was 169ms. The longest took 8.457 seconds. The mean is 2.63 seconds per doc. The numbers did not vary much between running 100 docs vs 1000.

That is incredibly slow. Can you share the full mappings for one of the indices (assuming they are similar)?

Sorry it took me a bit to respond Christian. Here is the template with the mapping we use as default. Here is a full mapping for an actual index (/_mappings).

Thanks for that. It looks like the documents have quite a lot of fields, but there is only a single nested field (apparatus) and no fields that seem dynamically generated. That as far as I can tell eliminates the data related aspects I mentioned as a potential cause.

I am not sure what is going on, but it would be interesting to see if indexing small uniform documents perform as expected. It would also be useful to eliminate the ReadOnlyRest plugin and see if that makes any difference.

Do you mean each doc took 250-600ms or was that the time for the whole bulk request? If that's per doc, do you mean that a 1000-doc bulk took 5-10 minutes to complete?

Likewise, do you mean the whole request took about the same time whether the bulk size was 100 or 1000?

Did you manage to start capturing the took value in the response from Elasticsearch?

Hey David,

Do you mean each doc took 250-600ms or was that the time for the whole bulk request? If that's per doc, do you mean that a 1000-doc bulk took 5-10 minutes to complete?

Each doc took that long, and just to be explicit again, this is not using bulk API. This is a for loop that sent 100/1000 individual requests (executed in serial fashion via await) to measure request throughput without any other operations occurring in the system. The total time was about 5-10 minutes.

Likewise, do you mean the whole request took about the same time whether the bulk size was 100 or 1000?

Not whole request time, just time per doc. Sadly I forgot to capture the took value during this test, but I do have this data (below) which may provide similar insights. This first output is from a single re-indexing job I did as a test on our cluster. The Took value at the bottom of the json object is output from a timestamp difference from job-start to finish on the client side. I reindex a single departments data for the year 2019 (which is currently binned by month) into a single "year" index. This was only 284 docs spread across 12 indices (very bad...I know, I know). There is almost a 13 second difference between the took time from the server and my calculation. That discrepancy aside it still took ~77 seconds for it to finish according to ES. This is very peculiar because this is an internal operation. Also I passed an array for source to the reindex api call so it is as efficient of a request as I can make from the client.

taskID: 27ARYU__TDixIQpDon76-Q:2928902
{
  completed: true,
  task: {
    node: '27ARYU__TDixIQpDon76-Q',
    id: 2928902,
    type: 'transport',
    action: 'indices:data/write/reindex',
    status: {
      total: 284,
      updated: 0,
      created: 284,
      deleted: 0,
      batches: 1,
      version_conflicts: 0,
      noops: 0,
      retries: [Object],
      throttled_millis: 0,
      requests_per_second: -1,
      throttled_until_millis: 0
    },
    description: 'reindex from [98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-02-es6fireincident-m1, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-04-es6fireincident-m1, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-11-es6fireincident, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-09-es6fireincident, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-08-es6fireincident, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-05-es6fireincident, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-03-es6fireincident-m1, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-01-es6fireincident-m1, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-07-es6fireincident, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-06-es6fireincident, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-12-es6fireincident, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-10-es6fireincident] to [98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-es6fireincident]',
    start_time_in_millis: 1608104270657,
    running_time_in_nanos: 76968546958,
    cancellable: true,
    headers: {}
  },
  response: {
    took: 76965,
    timed_out: false,
    total: 284,
    updated: 0,
    created: 284,
    deleted: 0,
    batches: 1,
    version_conflicts: 0,
    noops: 0,
    retries: { bulk: 0, search: 0 },
    throttled_millis: 0,
    requests_per_second: -1,
    throttled_until_millis: 0,
    failures: []
  }
}
Reindex completed: 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-es6fireincident
Took 90439ms

I ran the same test on an even smaller set of data (same department for 2018), there were only a handful of months we had data for in 2018 (34 docs spread over 3 indices). Still terrible performance.

{
  completed: true,
  task: {
    node: '27ARYU__TDixIQpDon76-Q',
    id: 3315814,
    type: 'transport',
    action: 'indices:data/write/reindex',
    status: {
      total: 34,
      updated: 0,
      created: 34,
      deleted: 0,
      batches: 1,
      version_conflicts: 0,
      noops: 0,
      retries: [Object],
      throttled_millis: 0,
      requests_per_second: -1,
      throttled_until_millis: 0
    },
    description: 'reindex from [98751-in-wayne_township_volunteer_fire_department-fire-incident-2018-12-es6fireincident-m1, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2018-10-es6fireincident-m1, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2018-11-es6fireincident-m1] to [98751-in-wayne_township_volunteer_fire_department-fire-incident-2018-es6fireincident]',
    start_time_in_millis: 1608147013960,
    running_time_in_nanos: 41048340461,
    cancellable: true,
    headers: {}
  },
  response: {
    took: 41046,
    timed_out: false,
    total: 34,
    updated: 0,
    created: 34,
    deleted: 0,
    batches: 1,
    version_conflicts: 0,
    noops: 0,
    retries: { bulk: 0, search: 0 },
    throttled_millis: 0,
    requests_per_second: -1,
    throttled_until_millis: 0,
    failures: []
  }
}
Took 60320ms

The metrics are a bit disheartening because we know we need to reindex to reduce shards and condense data into fewer indices, but it seems like this is gonna take a long time haha. Anyways hopefully this is helpful information. @Christian_Dahlqvist I will look into dropping RoR plugin to see if it helps. Thanks again for all the help so far everyone.

Hmm, that's definitely weird, what on earth was happening in those 13 seconds? That should only be travel time to/from the server, plus time to find a spare thread and parse the request, there's no way it should be more than a few milliseconds in such a quiet cluster.

Reindex isn't very internal, it's really just a scroll search plugged into a bulk indexer. It runs on a node in the cluster for sure, but it's only using client-facing APIs.

@beckerr Is the elasticsearch container (you mentioned k8s) or process in a multi-tenanted environment? Is there any chance that other processes on the host are consuming resources to the extent that elasticsearch is waiting in line for resources?

I also noticed that there are quite a few references to issues with maximum file descriptors in a k8s context. This thread might be a good starting point for that line of investigation. There are a few references to issues with elasticsearch on that thread due to low default ulimit settings. This is a bit of a shot in the dark with not enough info to make any conclusions, but could represent a new line of investigation.

Thank you for the info/thread @egalpin,

I will check with dev-ops on multitenancy. I know we use fluentd to capture metrics but I believe I had already disabled this thinking it might be the issue. I can revisit though and look for other processes etc.

@DavidTurner

Hmm, that's definitely weird, what on earth was happening in those 13 seconds?

This is my bad, I forgot I poll for task completion at 30s intervals. The task may actually complete before the polling function returns and that time was getting added to my timestamp...this is eliminated from the equation now.

Reindex isn't very internal it's really just a scroll search plugged into a bulk indexer. It runs on a node in the cluster for sure, but it's only using client-facing APIs.

Are you saying that this is a function that gets broken down into multiple client calls with several responses under the hood or just that the nodes are using the same client API to reindex between one another? What I was originally trying to imply with "internal" is that the index data is only copied from node to node in the cluster and not from the local client.

The latter: you are correct that it does its work without involving the "external" client that originally sent the POST _reindex call. But it's not doing anything especially magical within the cluster, it just saves the round trips on the network, so its performance should not be much different to bulk indexing from outside.

If you are using VMs or containers, make sure Elasticsearch has access to the resources you believe you have given it and that memory is not overprovisioned and mapped to disk behind the scenes.

Hey @beckerr, can you repeat the re-indexing test with the following settings applied on the index:

  1. Scenario 1: "refresh_interval" : "30s"
  2. Scenario 2: "refresh_interval" : "-1"

Just curious to see if this speeds up re-indexing.

Also, disable replicas as well by setting "number_of_replicas": 0 for the index during both the scenarios.

Thanks for the replies everyone,

@DavidTurner thanks for the clarification!
@Christian_Dahlqvist when you said -

...that memory is not overprovisioned and mapped to disk behind the scenes

I assume this to mean that the JVM is not accidentally using virtual memory? Pretty sure I already checked this and confirmed it is not before I started the discussion. But I will double check.

@sandeepkanabar
I set refresh:false when I make the reindex call. Is that the same effect as "refresh_interval" : "-1" ?
I will try disabling replicas too.

General update - we are just grinding away on re-indexing to reduce shard count and get the data organized in a sane manner once again. I will keep you all posted on how it goes and how much improvement we see (probably will take a few days). I really want to thank you all for the support and suggestions thus far, you all are warriors for slogging through this with me. The responsiveness and passion in general of this community has left me impressed!

1 Like

We finally finished re-indexing everything we could. Below is the final shard count which has been reduced from 24+k.

{
    "cluster_name": "elasticsearch",
    "status": "green",
    "timed_out": false,
    "number_of_nodes": 52,
    "number_of_data_nodes": 43,
    "active_primary_shards": 1777,
    "active_shards": 3554,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 0,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 100.0
}

Unfortunately our indexing performance is still not good. Here is the output from one of the last re-indexing tasks I performed:

{
  completed: true,
  task: {
    node: 'r39uxV_XSRy7zYDQY7BTTg',
    id: 28904733,
    type: 'transport',
    action: 'indices:data/write/reindex',
    status: {
      total: 287796,
      updated: 261406,
      created: 26390,
      deleted: 0,
      batches: 288,
      version_conflicts: 0,
      noops: 0,
      retries: [Object],
      throttled_millis: 0,
      requests_per_second: -1,
      throttled_until_millis: 0
    },
    description: 'reindex from [88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-11-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-12-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-02-es6apparatusfireincident-m3, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-08-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-06-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-10-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-07-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-03-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-05-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-09-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-04-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-01-es6apparatusfireincident-m3] to [88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-es6apparatusfireincident]',
    start_time_in_millis: 1609888634911,
    running_time_in_nanos: 1189569144096,
    cancellable: true,
    headers: {}
  },
  response: {
    took: 1189564,
    timed_out: false,
    total: 287796,
    updated: 261406,
    created: 26390,
    deleted: 0,
    batches: 288,
    version_conflicts: 0,
    noops: 0,
    retries: { bulk: 0, search: 0 },
    throttled_millis: 0,
    requests_per_second: -1,
    throttled_until_millis: 0,
    failures: []
  }
}

You can see it took almost 20 minutes to re-index approx 287k docs. We are looking even deeper into our cluster configuration. I will continue to post updates.

We finally solved our issue! I highly recommend this article as it had all the answers for us. Essentially the issue is with the JVM trying to guess at some settings for garbage collection but since we are running via Kubernetes the JVM has no actual way to infer the machine properties. Thanks everyone for the help. I really hope this helps others in the future as well.

2 Likes

The article is indeed an excellent read. Thanks for sharing.

1 Like