Intermittent index creation failure on 6.0.0-rc1

I have an integration test that creates a half dozen indices and then tests a bunch of stuff. A few times now, the index fails to get created and I see this in the logs:

[2017-10-26T13:07:34,690][INFO ][o.e.c.m.MetaDataCreateIndexService] [my8ovVE] [myindex] creating index, cause [api], templates [], shards [1]/[0], mappings [document]
[2017-10-26T13:07:34,706][WARN ][o.e.a.b.TransportShardBulkAction] [my8ovVE] unexpected error during the primary phase for action [indices:data/write/bulk[s]], request [BulkShardRequest [[myindex][0]] containing [index {[myindex][document][31694573011], source[{"parent_doc_id": 31694573011, "moar_doc_ids": []}]}]]

I'm running these tests locally on a Mac. Maybe related: about 8 different indices are created at around the same time via multiple processes.

I haven't found anyone complaining about this particular issue, so I'm mentioning it here in case it rings a bell for someone.

Probably not related but check your available disk space.
In 6.0 if you have only 15% free, indices will be marked as read only. I hit that yesterday: https://github.com/dadoonet/fscrawler/commit/e029b14d8f8bb80e750c941905c810eeace19f9f

Thanks for replying! I have 823GB free so that's not the issue. :sunglasses:

It's only happening about 1 in 20 test runs so it's hard to reproduce or see a pattern. If I get more info over time, I'll post it here.

Could you post the actual exception that shows up in logs after these 2 lines?

org.elasticsearch.index.IndexNotFoundException: no such index
	at org.elasticsearch.cluster.routing.RoutingTable.shardRoutingTable(RoutingTable.java:137) ~[elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.primary(TransportReplicationAction.java:797) ~[elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:732) ~[elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onNewClusterState(TransportReplicationAction.java:886) ~[elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onNewClusterState(ClusterStateObserver.java:303) ~[elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.clusterChanged(ClusterStateObserver.java:191) ~[elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateListeners$7(ClusterApplierService.java:506) ~[elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at java.util.concurrent.ConcurrentHashMap$KeySpliterator.forEachRemaining(ConcurrentHashMap.java:3527) [?:1.8.0_112]
	at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:743) [?:1.8.0_112]
	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) [?:1.8.0_112]
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:503) [elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:486) [elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429) [elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:158) [elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:247) [elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:210) [elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]

Do you explicitly create an index before running this bulk request or this index is created dynamically in this bulk request?

I actually am not using the bulk API here to create the index. I create each index explicitly before using it; no dynamic index creation anywhere.

I don't know why this:
[2017-10-26T13:07:34,690][INFO ][o.e.c.m.MetaDataCreateIndexService] [my8ovVE] [myindex] creating index, cause [api], templates [], shards [1]/[0], mappings [document]
would lead to this:
[2017-10-26T13:07:34,706][WARN ][o.e.a.b.TransportShardBulkAction] [my8ovVE] unexpected error during the primary phase for action [indices:data/write/bulk[s]], request [BulkShardRequest [[myindex][0]] containing [index {[myindex][document][31694573011], source[{"parent_doc_id": 31694573011, "moar_doc_ids": []}]}]]

@loren thanks for the report. That sounds like a bug.

You're saying:

but nothing indicates from the logs that the index creation failed. Do you mean to say that the bulk operation failed?

In your test, are you using the bulk API or are you using the single-document index API? (Both get internally mapped to the bulk API in v6, so this could point to an issue for when we convert from one to the other).

Also can you maybe share the integration test or bits of it?

I'm using the single-document index API. I didn't know that this got converted under the hood to use the bulk API.

I can't share the tests unfortunately, but it's pretty straightforward to describe it. There are 5 python processes that get launched at the same time. Four of them create indices A, B, C, and D using the single-document index API via the 5.4.0 python client. The fifth process creates index E and then index F. All 6 calls generally happen within the span of 300ms. In the stack trace I included above, it was index F that triggered the error. I think it's always index F that triggers it, but I don't recall and can't reproduce it.

And yes, to be clear, after the test runs (and fails of course), that one index is not present in the cluster.

I'm confused by the terminology here. You say that

create indices A, B, C, and D using the single-document index API

and then you say that you're not using dynamic index creation.

Creating an index is different than indexing a document into Elasticsearch. From the description above it is not clear to me which one is meant here.

When you say:

one index is not present in the cluster

Do you mean that the document is not present, or the index is not present?

If the index is not present, then there should be a failure in the logs that would indicate so. The logs you've shown don't provide such a failure, they indicate that a document could not be indexed, however. Can you provide the full logs?

Can you also repeat the experiment by running Elasticsearch with assertions enabled, i.e. put the "-ea" flag into the config/jvm.options file?

I create the indices like this via the Python client:

es = elasticsearch.Elasticsearch(host, **ELASTICSEARCH_DEFAULTS)
es.indices.create(index, body, ignore=400)

In the failure scenario I'm describing, the index does not get created and that stack trace gets printed. I do not see any logging indicating that a document could not get created.

The only other errors that show up are in my test code output. Here is all the surrounding context that I have from the ES logs:

[2017-10-26T13:07:33,169][INFO ][o.e.c.m.MetaDataDeleteIndexService] [my8ovVE] [index2/ewaQnzcYS9iOaHsYbcE5JA] deleting index
[2017-10-26T13:07:33,169][INFO ][o.e.c.m.MetaDataDeleteIndexService] [my8ovVE] [index3/7aP6Wx8yRXW_1i7Ox2736w] deleting index
[2017-10-26T13:07:33,169][INFO ][o.e.c.m.MetaDataDeleteIndexService] [my8ovVE] [myindex/Dahm9qx8Q3mJOsi1JxnneA] deleting index
[2017-10-26T13:07:33,170][INFO ][o.e.c.m.MetaDataDeleteIndexService] [my8ovVE] [index4/Pu8o8VXVTUWYI-PfNkeHew] deleting index
[2017-10-26T13:07:33,170][INFO ][o.e.c.m.MetaDataDeleteIndexService] [my8ovVE] [index5/2f1iHrIcSP2tYKIr24-uag] deleting index
[2017-10-26T13:07:33,170][INFO ][o.e.c.m.MetaDataDeleteIndexService] [my8ovVE] [index5/PxLJndP7SESctud1bC3k5w] deleting index
[2017-10-26T13:07:34,358][INFO ][o.e.c.m.MetaDataCreateIndexService] [my8ovVE] [index5] creating index, cause [api], templates [], shards [1]/[0], mappings [docs]
[2017-10-26T13:07:34,378][INFO ][o.e.c.m.MetaDataCreateIndexService] [my8ovVE] [index5] creating index, cause [api], templates [], shards [1]/[0], mappings [urls]
[2017-10-26T13:07:34,529][INFO ][o.e.c.m.MetaDataCreateIndexService] [my8ovVE] [docs] creating index, cause [api], templates [], shards [1]/[0], mappings [docs]
[2017-10-26T13:07:34,541][INFO ][o.e.c.m.MetaDataCreateIndexService] [my8ovVE] [index3] creating index, cause [api], templates [], shards [1]/[0], mappings [status]
[2017-10-26T13:07:34,565][INFO ][o.e.c.m.MetaDataCreateIndexService] [my8ovVE] [errors] creating index, cause [api], templates [], shards [1]/[0], mappings [docs]
[2017-10-26T13:07:34,667][INFO ][o.e.c.m.MetaDataCreateIndexService] [my8ovVE] [index4] creating index, cause [api], templates [], shards [1]/[0], mappings [document]
[2017-10-26T13:07:34,690][INFO ][o.e.c.m.MetaDataCreateIndexService] [my8ovVE] [myindex] creating index, cause [api], templates [], shards [1]/[0], mappings [document]
[2017-10-26T13:07:34,706][WARN ][o.e.a.b.TransportShardBulkAction] [my8ovVE] unexpected error during the primary phase for action [indices:data/write/bulk[s]], request [BulkShardRequest [[myindex][0]] containing [index {[myindex][document][31694573011], source[{"parent_doc_id": 31694573011, "moar_doc_ids": []}]}]]
org.elasticsearch.index.IndexNotFoundException: no such index
	at org.elasticsearch.cluster.routing.RoutingTable.shardRoutingTable(RoutingTable.java:137) ~[elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.primary(TransportReplicationAction.java:797) ~[elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:732) ~[elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onNewClusterState(TransportReplicationAction.java:886) ~[elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onNewClusterState(ClusterStateObserver.java:303) ~[elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.clusterChanged(ClusterStateObserver.java:191) ~[elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateListeners$7(ClusterApplierService.java:506) ~[elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at java.util.concurrent.ConcurrentHashMap$KeySpliterator.forEachRemaining(ConcurrentHashMap.java:3527) [?:1.8.0_112]
	at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:743) [?:1.8.0_112]
	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) [?:1.8.0_112]
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:503) [elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:486) [elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429) [elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:158) [elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:247) [elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:210) [elasticsearch-6.0.0-rc1.jar:6.0.0-rc1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
[2017-10-26T13:07:37,020][INFO ][o.e.c.m.MetaDataCreateIndexService] [my8ovVE] [index2] creating index, cause [api], templates [], shards [1]/[0], mappings [docs]

I will try running with -ea for future tests.

ok, I think I know what's happening. The issue you're seeing can be caused by indexing while deleting an index and creating an index with the same name shortly after.

When a write/bulk request gets handled by a node, we resolve the shard id that the request should go to. The request then gets routed to the primary shard of that resolved index. Each index has a unique associated id, generated when the index got created. This means that the UUID of the deleted index is different than the one of the index that was created with same name.

When the request is resolved and rerouted to the node with the primary, it has the UUID of the old index. When it finally arrives at the primary, the index with that UUID does not exist anymore, just an index with same name, but different UUID. Part of our code incorrectly assumes that the index name is enough to validate the incoming request on the primary. When we actually access information about the shard in the routing table, we cannot find the shard for said index, because the UUIDs don't match, which produces the error you're seeing.

While we can (and should) certainly fix this issue, this test sounds a bit fragile to me. You're deleting and recreating an index concurrently to indexing into said index. Not sure what behavior should be tested with that.

The integration test on my end does not quite do what you suggest, but it's changed a lot in a short period of time.

I guess the thing to do on my end is just run -ea and see if it happens again.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.