Unable to index new documents after cluster restart

Hi,

Everything was working fine before, I tried to index a large document, ran
into java heap:out of memory exception, so I decided to increase ES_MAX_MEM
and restart the cluster. Once done, I am now able to query/search my old
data which I had indexed before, but I am unable to index anything new. eg,
here is the error I get for PUT:

curl -XPUT 'http://localhost:9200/keywords/kw/11917_97095_822656_21256827'
-d
'{"KeywordID":"21256827","AdGroupID":"822656","NwAccountID":"11917","NwXRefCode":"757789966","NwXRefStatus":"Active","NetworkID":"1"}}'
{"error":"UnavailableShardsException[[keywords][4] [4] shardIt, [1] active :
Timeout waiting for [1m], request: index
{[keywords][kw][11917_97095_822656_21256827],
source[{"KeywordID":"21256827","AdGroupID":"822656","NwAccountID":"11917","NwXRefCode":"757789966","NwXRefStatus":"Active","NetworkID":"1"}}]}]","status":500}

I get the same error for bulk operations as well.

What does this mean?

Hari

I see this exception in the logs of some nodes:

[2011-06-24 11:28:38,359][WARN ][index.engine.robin ] [Ozymandias]
[keywords][0] failed to flush after setting shard to inactive
org.elasticsearch.index.engine.EngineClosedException: [keywords][0]
CurrentState[CLOSED]
at
org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:653)
at
org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:203)
at
org.elasticsearch.indices.memory.IndexingMemoryBufferController$ShardsIndicesStatusChecker.run(IndexingMemoryBufferController.java:147)
at
org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:201)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-06-24 11:28:38,371][WARN ][index.engine.robin ] [Ozymandias]
[keywords][3] failed to flush after setting shard to inactive
org.elasticsearch.index.engine.EngineClosedException: [keywords][3]
CurrentState[CLOSED]
at
org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:653)
at
org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:203)
at
org.elasticsearch.indices.memory.IndexingMemoryBufferController$ShardsIndicesStatusChecker.run(IndexingMemoryBufferController.java:147)
at
org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:201)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On another node, I saw this slightly diffferent exception:

[2011-06-24 11:28:44,936][WARN ][index.engine.robin ] [Puishannt]
[keywords][0] failed to flush after setting shard to inactive
org.elasticsearch.index.engine.FlushNotAllowedEngineException: [keywords][0]
Recovery is in progress, flush is not allowed
at
org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:657)
at
org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:203)
at
org.elasticsearch.indices.memory.IndexingMemoryBufferController$ShardsIndicesStatusChecker.run(IndexingMemoryBufferController.java:147)
at
org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:201)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

How long does recovery take if I restart the cluster? My total index size is
about 800 GB on 7 nodes. I have only one index (settings: 7 shards/3
replicas). Tried to force-open the index, but still same error)

Hari

On Fri, Jun 24, 2011 at 10:04 PM, Hari Shankar shaan.hari@gmail.com wrote:

Hi,

Everything was working fine before, I tried to index a large document, ran
into java heap:out of memory exception, so I decided to increase ES_MAX_MEM
and restart the cluster. Once done, I am now able to query/search my old
data which I had indexed before, but I am unable to index anything new. eg,
here is the error I get for PUT:

curl -XPUT 'http://localhost:9200/keywords/kw/11917_97095_822656_21256827'
-d
'{"KeywordID":"21256827","AdGroupID":"822656","NwAccountID":"11917","NwXRefCode":"757789966","NwXRefStatus":"Active","NetworkID":"1"}}'
{"error":"UnavailableShardsException[[keywords][4] [4] shardIt, [1] active
: Timeout waiting for [1m], request: index
{[keywords][kw][11917_97095_822656_21256827],
source[{"KeywordID":"21256827","AdGroupID":"822656","NwAccountID":"11917","NwXRefCode":"757789966","NwXRefStatus":"Active","NetworkID":"1"}}]}]","status":500}

I get the same error for bulk operations as well.

What does this mean?

Hari

Seems this got fixed now. It has been about 3 hours since I had restarted
the cluster now. Last time I tried (unsuccessfully) was about 30 mins after
restart. Is there a way to estimate approximately how long it would take to
recover an index after cluster restart/index close? Is it possible to
specify an order in which indices will be recovered after cluster restart?
Say if some of the smaller indices need to recover first. Or is it a
parallel process?

Hari

On Fri, Jun 24, 2011 at 10:16 PM, Hari Shankar shaan.hari@gmail.com wrote:

I see this exception in the logs of some nodes:

[2011-06-24 11:28:38,359][WARN ][index.engine.robin ] [Ozymandias]
[keywords][0] failed to flush after setting shard to inactive
org.elasticsearch.index.engine.EngineClosedException: [keywords][0]
CurrentState[CLOSED]
at
org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:653)
at
org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:203)
at
org.elasticsearch.indices.memory.IndexingMemoryBufferController$ShardsIndicesStatusChecker.run(IndexingMemoryBufferController.java:147)
at
org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:201)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-06-24 11:28:38,371][WARN ][index.engine.robin ] [Ozymandias]
[keywords][3] failed to flush after setting shard to inactive
org.elasticsearch.index.engine.EngineClosedException: [keywords][3]
CurrentState[CLOSED]
at
org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:653)
at
org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:203)
at
org.elasticsearch.indices.memory.IndexingMemoryBufferController$ShardsIndicesStatusChecker.run(IndexingMemoryBufferController.java:147)
at
org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:201)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On another node, I saw this slightly diffferent exception:

[2011-06-24 11:28:44,936][WARN ][index.engine.robin ] [Puishannt]
[keywords][0] failed to flush after setting shard to inactive
org.elasticsearch.index.engine.FlushNotAllowedEngineException:
[keywords][0] Recovery is in progress, flush is not allowed
at
org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:657)
at
org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:203)
at
org.elasticsearch.indices.memory.IndexingMemoryBufferController$ShardsIndicesStatusChecker.run(IndexingMemoryBufferController.java:147)
at
org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:201)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

How long does recovery take if I restart the cluster? My total index size
is about 800 GB on 7 nodes. I have only one index (settings: 7 shards/3
replicas). Tried to force-open the index, but still same error)

Hari

On Fri, Jun 24, 2011 at 10:04 PM, Hari Shankar shaan.hari@gmail.comwrote:

Hi,

Everything was working fine before, I tried to index a large document, ran
into java heap:out of memory exception, so I decided to increase ES_MAX_MEM
and restart the cluster. Once done, I am now able to query/search my old
data which I had indexed before, but I am unable to index anything new. eg,
here is the error I get for PUT:

curl -XPUT 'http://localhost:9200/keywords/kw/11917_97095_822656_21256827'
-d
'{"KeywordID":"21256827","AdGroupID":"822656","NwAccountID":"11917","NwXRefCode":"757789966","NwXRefStatus":"Active","NetworkID":"1"}}'
{"error":"UnavailableShardsException[[keywords][4] [4] shardIt, [1] active
: Timeout waiting for [1m], request: index
{[keywords][kw][11917_97095_822656_21256827],
source[{"KeywordID":"21256827","AdGroupID":"822656","NwAccountID":"11917","NwXRefCode":"757789966","NwXRefStatus":"Active","NetworkID":"1"}}]}]","status":500}

I get the same error for bulk operations as well.

What does this mean?

Hari