Debugging extremely slow indexing

sandeepkanabar · December 14, 2020, 9:32pm

Fair point indeed. Which is why I asked the OP if they could fire a bunch of simple curl indexing requests in loop to see how it fares. If the curl request works pretty quickly, then chances are that there's an issue at the client side.

beckerr · December 15, 2020, 12:22am

Thank you for all the info everyone. As @DavidTurner mentioned the focus is on indexing speed. All of the suggestions are informative of what to do in the future though and I am grateful for those. There are a ton of shards out there but I have (theoretically) enough nodes/resources for all of them according to spec/recommendations. It makes me think I am missing something fundamentally that could re-occur if we set up a new cluster too, especially if we do not fully understand it.

In response to:

Also, the r4 instances are not optimised for I/o operations but rather for memory . Do you run heavy queries in this cluster?

We do not run a lot of heavy or complex querying in this application, we are much more write heavy.

I will work on getting those request numbers for you @sandeepkanabar and try to get more info in general for everyone shortly.

Christian_Dahlqvist · December 15, 2020, 3:35am

I have listed all the reasons I have been able to come up with for such poor performance based on what I have seen previously. Most of that is related to the data and cluster state rather than the cluster itself, e.g. use of dynamic mappings causing frequent slow cluster updates, use of large nested structures in documents etc.

As most nodes as far as I understand appear almost idle it might be interesting to try to load some simple, uniform data (no dynamic mappings) into a single test index with just a few primary shards to see if this shows the same behaviour or is significantly faster. It is probably worth indexing documents individually as well as in bulk. This might help pinpoint whether the problem is with the data, cluster or client.

beckerr · December 15, 2020, 7:22pm

Thank you for the information Christian! We will try running with some simpler data and see if this is the culprit. We ran the benchmark tests that @sandeepkanabar asked for with our current data last night. We did a batch of 100 docs and we did a batch of 1000 docs (each as a single request, no bulk API). Most documents, ~65%, took between 250-600ms each, all other docs took more time with the next most common duration being about 8 seconds. The fastest we indexed a doc was 169ms. The longest took 8.457 seconds. The mean is 2.63 seconds per doc. The numbers did not vary much between running 100 docs vs 1000.

Christian_Dahlqvist · December 15, 2020, 7:25pm

That is incredibly slow. Can you share the full mappings for one of the indices (assuming they are similar)?

beckerr · December 15, 2020, 11:56pm

Sorry it took me a bit to respond Christian. Here is the template with the mapping we use as default. Here is a full mapping for an actual index (/_mappings).

Christian_Dahlqvist · December 16, 2020, 6:32am

Thanks for that. It looks like the documents have quite a lot of fields, but there is only a single nested field (apparatus) and no fields that seem dynamically generated. That as far as I can tell eliminates the data related aspects I mentioned as a potential cause.

I am not sure what is going on, but it would be interesting to see if indexing small uniform documents perform as expected. It would also be useful to eliminate the ReadOnlyRest plugin and see if that makes any difference.

DavidTurner · December 16, 2020, 8:32am

Do you mean each doc took 250-600ms or was that the time for the whole bulk request? If that's per doc, do you mean that a 1000-doc bulk took 5-10 minutes to complete?

Likewise, do you mean the whole request took about the same time whether the bulk size was 100 or 1000?

Did you manage to start capturing the took value in the response from Elasticsearch?

beckerr · December 16, 2020, 7:42pm

Hey David,

Do you mean each doc took 250-600ms or was that the time for the whole bulk request? If that's per doc, do you mean that a 1000-doc bulk took 5-10 minutes to complete?

Each doc took that long, and just to be explicit again, this is not using bulk API. This is a for loop that sent 100/1000 individual requests (executed in serial fashion via await) to measure request throughput without any other operations occurring in the system. The total time was about 5-10 minutes.

Likewise, do you mean the whole request took about the same time whether the bulk size was 100 or 1000?

Not whole request time, just time per doc. Sadly I forgot to capture the took value during this test, but I do have this data (below) which may provide similar insights. This first output is from a single re-indexing job I did as a test on our cluster. The Took value at the bottom of the json object is output from a timestamp difference from job-start to finish on the client side. I reindex a single departments data for the year 2019 (which is currently binned by month) into a single "year" index. This was only 284 docs spread across 12 indices (very bad...I know, I know). There is almost a 13 second difference between the took time from the server and my calculation. That discrepancy aside it still took ~77 seconds for it to finish according to ES. This is very peculiar because this is an internal operation. Also I passed an array for source to the reindex api call so it is as efficient of a request as I can make from the client.

taskID: 27ARYU__TDixIQpDon76-Q:2928902
{
  completed: true,
  task: {
    node: '27ARYU__TDixIQpDon76-Q',
    id: 2928902,
    type: 'transport',
    action: 'indices:data/write/reindex',
    status: {
      total: 284,
      updated: 0,
      created: 284,
      deleted: 0,
      batches: 1,
      version_conflicts: 0,
      noops: 0,
      retries: [Object],
      throttled_millis: 0,
      requests_per_second: -1,
      throttled_until_millis: 0
    },
    description: 'reindex from [98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-02-es6fireincident-m1, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-04-es6fireincident-m1, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-11-es6fireincident, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-09-es6fireincident, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-08-es6fireincident, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-05-es6fireincident, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-03-es6fireincident-m1, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-01-es6fireincident-m1, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-07-es6fireincident, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-06-es6fireincident, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-12-es6fireincident, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-10-es6fireincident] to [98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-es6fireincident]',
    start_time_in_millis: 1608104270657,
    running_time_in_nanos: 76968546958,
    cancellable: true,
    headers: {}
  },
  response: {
    took: 76965,
    timed_out: false,
    total: 284,
    updated: 0,
    created: 284,
    deleted: 0,
    batches: 1,
    version_conflicts: 0,
    noops: 0,
    retries: { bulk: 0, search: 0 },
    throttled_millis: 0,
    requests_per_second: -1,
    throttled_until_millis: 0,
    failures: []
  }
}
Reindex completed: 98751-in-wayne_township_volunteer_fire_department-fire-incident-2019-es6fireincident
Took 90439ms

I ran the same test on an even smaller set of data (same department for 2018), there were only a handful of months we had data for in 2018 (34 docs spread over 3 indices). Still terrible performance.

{
  completed: true,
  task: {
    node: '27ARYU__TDixIQpDon76-Q',
    id: 3315814,
    type: 'transport',
    action: 'indices:data/write/reindex',
    status: {
      total: 34,
      updated: 0,
      created: 34,
      deleted: 0,
      batches: 1,
      version_conflicts: 0,
      noops: 0,
      retries: [Object],
      throttled_millis: 0,
      requests_per_second: -1,
      throttled_until_millis: 0
    },
    description: 'reindex from [98751-in-wayne_township_volunteer_fire_department-fire-incident-2018-12-es6fireincident-m1, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2018-10-es6fireincident-m1, 98751-in-wayne_township_volunteer_fire_department-fire-incident-2018-11-es6fireincident-m1] to [98751-in-wayne_township_volunteer_fire_department-fire-incident-2018-es6fireincident]',
    start_time_in_millis: 1608147013960,
    running_time_in_nanos: 41048340461,
    cancellable: true,
    headers: {}
  },
  response: {
    took: 41046,
    timed_out: false,
    total: 34,
    updated: 0,
    created: 34,
    deleted: 0,
    batches: 1,
    version_conflicts: 0,
    noops: 0,
    retries: { bulk: 0, search: 0 },
    throttled_millis: 0,
    requests_per_second: -1,
    throttled_until_millis: 0,
    failures: []
  }
}
Took 60320ms

The metrics are a bit disheartening because we know we need to reindex to reduce shards and condense data into fewer indices, but it seems like this is gonna take a long time haha. Anyways hopefully this is helpful information. @Christian_Dahlqvist I will look into dropping RoR plugin to see if it helps. Thanks again for all the help so far everyone.

DavidTurner · December 16, 2020, 8:43pm

Hmm, that's definitely weird, what on earth was happening in those 13 seconds? That should only be travel time to/from the server, plus time to find a spare thread and parse the request, there's no way it should be more than a few milliseconds in such a quiet cluster.

Reindex isn't very internal, it's really just a scroll search plugged into a bulk indexer. It runs on a node in the cluster for sure, but it's only using client-facing APIs.

egalpin · December 16, 2020, 9:21pm

@beckerr Is the elasticsearch container (you mentioned k8s) or process in a multi-tenanted environment? Is there any chance that other processes on the host are consuming resources to the extent that elasticsearch is waiting in line for resources?

I also noticed that there are quite a few references to issues with maximum file descriptors in a k8s context. This thread might be a good starting point for that line of investigation. There are a few references to issues with elasticsearch on that thread due to low default ulimit settings. This is a bit of a shot in the dark with not enough info to make any conclusions, but could represent a new line of investigation.

beckerr · December 16, 2020, 11:23pm

Thank you for the info/thread @egalpin,

I will check with dev-ops on multitenancy. I know we use fluentd to capture metrics but I believe I had already disabled this thinking it might be the issue. I can revisit though and look for other processes etc.

@DavidTurner

Hmm, that's definitely weird, what on earth was happening in those 13 seconds?

This is my bad, I forgot I poll for task completion at 30s intervals. The task may actually complete before the polling function returns and that time was getting added to my timestamp...this is eliminated from the equation now.

Reindex isn't very internal it's really just a scroll search plugged into a bulk indexer. It runs on a node in the cluster for sure, but it's only using client-facing APIs.

Are you saying that this is a function that gets broken down into multiple client calls with several responses under the hood or just that the nodes are using the same client API to reindex between one another? What I was originally trying to imply with "internal" is that the index data is only copied from node to node in the cluster and not from the local client.

DavidTurner · December 17, 2020, 7:47am

The latter: you are correct that it does its work without involving the "external" client that originally sent the POST _reindex call. But it's not doing anything especially magical within the cluster, it just saves the round trips on the network, so its performance should not be much different to bulk indexing from outside.

Christian_Dahlqvist · December 17, 2020, 7:58am

If you are using VMs or containers, make sure Elasticsearch has access to the resources you believe you have given it and that memory is not overprovisioned and mapped to disk behind the scenes.

sandeepkanabar · December 17, 2020, 12:08pm

Hey @beckerr, can you repeat the re-indexing test with the following settings applied on the index:

Scenario 1: "refresh_interval" : "30s"
Scenario 2: "refresh_interval" : "-1"

Just curious to see if this speeds up re-indexing.

Also, disable replicas as well by setting "number_of_replicas": 0 for the index during both the scenarios.

beckerr · December 17, 2020, 10:50pm

Thanks for the replies everyone,

@DavidTurner thanks for the clarification!
@Christian_Dahlqvist when you said -

...that memory is not overprovisioned and mapped to disk behind the scenes

I assume this to mean that the JVM is not accidentally using virtual memory? Pretty sure I already checked this and confirmed it is not before I started the discussion. But I will double check.

@sandeepkanabar
I set refresh:false when I make the reindex call. Is that the same effect as "refresh_interval" : "-1" ?
I will try disabling replicas too.

General update - we are just grinding away on re-indexing to reduce shard count and get the data organized in a sane manner once again. I will keep you all posted on how it goes and how much improvement we see (probably will take a few days). I really want to thank you all for the support and suggestions thus far, you all are warriors for slogging through this with me. The responsiveness and passion in general of this community has left me impressed!

beckerr · January 6, 2021, 12:09am

We finally finished re-indexing everything we could. Below is the final shard count which has been reduced from 24+k.

{
    "cluster_name": "elasticsearch",
    "status": "green",
    "timed_out": false,
    "number_of_nodes": 52,
    "number_of_data_nodes": 43,
    "active_primary_shards": 1777,
    "active_shards": 3554,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 0,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 100.0
}

Unfortunately our indexing performance is still not good. Here is the output from one of the last re-indexing tasks I performed:

{
  completed: true,
  task: {
    node: 'r39uxV_XSRy7zYDQY7BTTg',
    id: 28904733,
    type: 'transport',
    action: 'indices:data/write/reindex',
    status: {
      total: 287796,
      updated: 261406,
      created: 26390,
      deleted: 0,
      batches: 288,
      version_conflicts: 0,
      noops: 0,
      retries: [Object],
      throttled_millis: 0,
      requests_per_second: -1,
      throttled_until_millis: 0
    },
    description: 'reindex from [88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-11-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-12-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-02-es6apparatusfireincident-m3, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-08-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-06-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-10-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-07-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-03-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-05-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-09-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-04-es6apparatusfireincident, 88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-01-es6apparatusfireincident-m3] to [88403-tn-memphis_division_of_fire_services-apparatus-fire-incident-2020-es6apparatusfireincident]',
    start_time_in_millis: 1609888634911,
    running_time_in_nanos: 1189569144096,
    cancellable: true,
    headers: {}
  },
  response: {
    took: 1189564,
    timed_out: false,
    total: 287796,
    updated: 261406,
    created: 26390,
    deleted: 0,
    batches: 288,
    version_conflicts: 0,
    noops: 0,
    retries: { bulk: 0, search: 0 },
    throttled_millis: 0,
    requests_per_second: -1,
    throttled_until_millis: 0,
    failures: []
  }
}

You can see it took almost 20 minutes to re-index approx 287k docs. We are looking even deeper into our cluster configuration. I will continue to post updates.

beckerr · January 11, 2021, 10:15pm

We finally solved our issue! I highly recommend this article as it had all the answers for us. Essentially the issue is with the JVM trying to guess at some settings for garbage collection but since we are running via Kubernetes the JVM has no actual way to infer the machine properties. Thanks everyone for the help. I really hope this helps others in the future as well.

sandeepkanabar · January 19, 2021, 11:48am

The article is indeed an excellent read. Thanks for sharing.

system · February 16, 2021, 11:48am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Debugging performance decrease after a node fault Elasticsearch	4	634	February 3, 2018
Slow Indexing speed / Bottleneck Elasticsearch	6	721	September 16, 2020
Investigate high GC time when indexing Elasticsearch	18	888	September 25, 2023
Poor performance and a lot of GC overhead Elasticsearch	19	13537	September 7, 2018
Indexing rate performance in cluster Elasticsearch	6	3758	July 5, 2017

Debugging extremely slow indexing

Related topics