Deleting and creating indexes over time kills index performance

We are using ES 2.1.0, quad core virtual machines, 12 GB ram, SAN disks.

We have a process where we will create a new index and when it is done point an alias at it.
Then later, we have to make a new index and again when it is done, point an alias and delete the old version.

We do this over and over probably a couple hundred times a day for different indices.

When the cluster has first started up, it will complete a large index of 1m+ records in about 20 minutes.
However, after the cluster runs for about 5 days to a week... the same type of index on the same type of data will go to 1 hr 20 minutes.... even indexes that usually take a couple minutes to generate will take 20+.

The fix for it is to do a rolling restart of the cluster. This involves only restarting the ES Service on each node, not a reboot (we run on Windows).

Does anyone know why this would occur? What could happen that makes ES start to index a whole new index more slowly?

Thanks!

How are you monitoring your cluster? Having Marvel installed will give you an idea of what is happening on your nodes that may be causing this.

I remember a bug where reusing index names would cause issues with memory
management making indexing very slow. It's fixed now but I don't believe
the fix is in your version. It's worth searching for.

This is correct. There is a bug of this nature affecting 2.1.0. The fix was released in Elasticsearch 2.1.1.

1 Like

We do have Marvel installed, but I don't see anything obvious from it. Anything in particular to look for?

Interesting. We don't reuse index names (they're all unique with a time stamp) but we do reuse alias names. Would this produce the same issue?

No, I'm pretty sure it would not (if there's any chance that you can upgrade to 2.1.1 to rule it out for certain, that would be helpful).

It sounds like there might be a different issue occurring here but it's not immediately obvious to me what it is. The golden key will be a minimal reproducible example that exhibits the slowdown.

Until then, I have few questions.

Do the guests have a common host?

To be clear, you mean that the guests have 12 GB allocated? Is RAM over provisioned to the guests on the host? What heap size do the Elasticsearch processes run with? Do you see anything in the logs during the slowdowns about GC pauses?

This is an interesting pattern; can you give more detail? How many actual indices are live indexing at any moment? Is it constant?

YWe run each VM on a different host. The hosts themselves have very low cpu utilization.

12 GB ram per VM. We set 6GB as the heap. We do not over provision ram on the hosts. These VMs have static memory set vs dynamic.

I haven't noticed anything odd in the logs but I will look again this morning for GC info.

The pattern is one that we have used for a long time dating back to Solr days.
Because of our data we cannot tell the new records. So we reindex the whole set. In order to keep production running we don't mess with the live index, we make a new one. Then we make sure the data looks good, then if it does, we point the production alias to the box and kill the old index.

So we implement the pattern build, point, delete old over and over again as often as every hour or two per index.

We only allow a single index to be built at once and our cluster has currently about 39 such indexes. Most indexes take minutes to build.

The process works well and is automated so we don't 'watch' it happen. We eventually start to see in our logs that indexing per index does start to drag until we restart the Es services.

Confirmed the issues happens on 2.1.1

This is all reasonable.

I'm doubting that this is the issue, but if we can rule it out that'd be helpful.

This makes sense. Thanks for the explanation.

So the cluster wigged out again and started indexing very slowly.
I restarted each ES Service on all 3 boxes and it returned to normal.

I looked in the log and I see tons of this:
the complete error is in the image attached... the complete text was too long to include.

[2016-01-13 00:00:06,137][INFO ][rest.suppressed          ] /_mget Params: {}
org.elasticsearch.action.ActionRequestValidationException: Validation Failed: 1: no documents to get;
	at org.elasticsearch.action.ValidateActions.addValidationError(ValidateActions.java:29)
	at org.elasticsearch.action.get.MultiGetRequest.validate(MultiGetRequest.java:284)
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:62)
	at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:58)
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:347)
	at org.elasticsearch.client.FilterClient.doExecute(FilterClient.java:52)
	at org.elasticsearch.rest.BaseRestHandler$HeadersAndContextCopyClient.doExecute(BaseRestHandler.java:83)
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:347)
	at org.elasticsearch.client.support.AbstractClient.multiGet(AbstractClient.java:555)

Any thoughts?

You are probably better off putting it in a gist.

It looks as though something is issuing an mget request without any body.

I can reproduce that message with

curl -XPOST localhost:9200/_mget -d'{}'

So my guess is something is doing that on your side.

This sure sounds similar to the issue I've been tracking here:

Interestingly enough, our daily process is to re-index the same data to a new daily index and then swap aliases... we do so little indexing outside of the nightly index that I couldn't honestly say if we're seeing index slow down but we are definitely seeing query slowdowns because of all the massive CPU ES uses for no reason until we restart all of the nodes (our ES cluster is 99.99% searching).

This is almost surely caused by something on your side, not Elasticsearch, but I'm unconvinced it's the cause of the performance issues.

Thanks... we'll do a code review to see what we find there.

Have you had any luck tracking down the logs and seeing if garbage collections are getting in the way?

We believe we found the random mget error in our code.
It didn't resolved the overall issue of the system becoming extremely slow at indexing after a few days of indexing. (Search performance doesn't seem to be affected)...

We still see a lot of the following error in the logs though:
RemoteTransportException[[es2][10.250.250.232:9300][cluster:monitor/nodes/info[n]]]; nested: NotSerializableExceptionWrapper;
Caused by: NotSerializableExceptionWrapper[null]
at java.util.ArrayList.sort(Unknown Source)
at java.util.Collections.sort(Unknown Source)
at org.elasticsearch.action.admin.cluster.node.info.PluginsInfo.getInfos(PluginsInfo.java:55)
at org.elasticsearch.action.admin.cluster.node.info.PluginsInfo.writeTo(PluginsInfo.java:86)
at org.elasticsearch.action.admin.cluster.node.info.NodeInfo.writeTo(NodeInfo.java:284)
at org.elasticsearch.transport.netty.NettyTransportChannel.sendResponse(NettyTransportChannel.java:97)
at org.elasticsearch.transport.netty.NettyTransportChannel.sendResponse(NettyTransportChannel.java:75)
at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:211)
at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:207)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:299)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

Is there some place specifically I should look for GC issues? None of the logs reference what I would take as a GC issue...

This is a known issue caused by concurrent monitoring requests (are you using Marvel?). It is fixed by #15541 and will be in the next patch release of the 2.1.x line of Elasticsearch, and was indirectly fixed by some refactoring for the 2.2.x line of Elasticsearch (not yet available). I highly doubt that it's the cause of the slowdowns though.