Single node ES 2.4.5 "cluster" locks up on reindex reads

Hello All,

I've been flailing for some time on upgrading a two-node ES 1.x cluster to 5.x, finally hitting a wall after upgrading the indices in place from 0.20 (some of them, anyway) to 1.7 and then upgrading as far as 2.4.5 successfully. Then I've been unable to use the migration plugin to convert the indices any further for an upgrade to 5.x.

Finally scrapped that plan and spun up a new 5.x cluster which is working beautifully, and am now flailing on importing the old indices to the new cluster. I've cloned one node of the existing old cluster (still in prod), deleted all indices and restored from a snapshot of the prod cluster, and then upgraded to 2.45. Set all indices to a single copy (no replicas). The short version is I now have a single node running a "holding" copy of the previous indices, status is green and all is well.

Note that this ELK cluster is primarily holding logs and not heavily queried so we're now running 1 shard, but the older indices have differing shard counts from 5 down to one. But I seem to be able to pull indices with any number of shards set, as long as the source is working at all (see below).

Now I'm trying to run the reindex API to pull from the holding node to the new cluster, and while I can pull a few successfully, after a few indices the source system bogs down and eventually ES stops responding. In fact, it stops responding to any calls including _cat/indices, _cluster/health, _tasks, anything. And if I run those too often, they'll crash it too. Eventually I have to restart ES, then wait for all unassigned shards to be assigned, about 3200 shards and 45 minutes.

This is running on a 4 core VM with 8GB RAM, BTW.

I admit I'm pretty new to ELK in general. Am I missing something obvious?

Hope to hear from you,

Randy in Seattle

Update: I've spun up a new VM with a clean install of openSUSE and an clean install of EX 2.4.5, and restored a snapshot from the clunky SLES build, so I now have a single node cluster with 4 cores and 16GB RAM and a clean stable copy of my ~1500 indices, all is green. I've scripted something in python to confirm that the indices on the new machine all have a matching document count to the originals.

Same problem. When I run a reindex from remote to the new cluster, I can only pull so much before the source cluster locks up. I can't even gracefully restart ES, I have to kill -9 the process and restart. Once the shards are all assigned and the cluster status is green again, I can start reindexes again. FWIW, the new system is notably quicker to get itself straightened out and, and I can complete more indices before it goes south.

I have the reindex call setting 2h timeouts for the socket and connect, with a size of 100, and I'm pausing five minutes between one reindex call and the next.

Is this a problem with the way I'm calling the remote reindex or with the configuration of the one-node source cluster?

Hope to hear from you...

Randy

If you have 1500 indices on a single node I would recommend you read this blog post about shards and sharding. Given the number of indices and shards you have I would not be surprised if you were suffering from heap pressure.

As we're running mostly a log archive with little query traffic, we're running with a single shard except for the older indices before we knew better. Should I consider adding nodes to the source cluster even if only for as long as we need it?

Late for a 4 hour meeting but will read your link ASAP!

I would recommend installing X-Pack monitoring so you get a better view of what goes on in the cluster. 1500 indices, even if they just have 1 primary shard, sounds like a lot on a node that size, so I would recommend changing how you organise your data and potentially merge indices using the reindex API , e.g. into weekly or monthly indices or by having different types of data share indices. Depending on the state of your cluster you may however need to scale up or out before you can make these changes.

X-Pack looks great from the docs but the problems I'm having are on the 2.4.5 cluster/machine that I'm using as a staging point for reindexing. Any guidance on troubleshooting in 2.4.5?

You can get some statistics from the cluster stats API. That will give us some idea about the state of the cluster even without monitoring installed.

Adding a second node to my staging cluster seems to have solved my problem. Indices are now importing just fine on the target cluster, both the tiny ones in a second or two and the largest so far (about 1M docs) in about 800 seconds.

Indices seem to be much larger on the target: an 3-shard index with 980K documents that took up 400Mb on the source (both the store.size and the pri.store.size) has the same doc count on the target, but the target 1-shard index takes up 2.7Gb of store.size and 1.1Gb of pri.store.size. What am I not understanding about these measures?

Elasticsearch 5.X uses doc_values to a greater extent than earlier versions, which can lead to an increase in size on disk. It is also important to check that your indices have comparable mappings, as mappings can have a significant impact on size on disk. In addition to this the number of segments can affect size as well. You may want to run force merge on a few indices (with max_num_segments set to 1) to see if this makes any difference. Last but not least you may also want to look at your data to see if you have a lot of sparse fields.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.