Index corruption when upload large number of documents (4billion+)

Hi,
We have a 98 node cluster of ES with each node 32GB RAM. 16GB is reserved
for ES via config file. The index has 98 shards with 2 replicas.

On this cluster we are loading a large number of documents (when done it
would be about 10 billion). In this use case about 40million documents are
generated per hour and we are pre-loading several days worth of documents to
prototype how ES will scale, and its query performance.

Right now we are facing problems getting data loaded. Indexing is turned
off. We use NEST client, with batch size of 10k. To speed up data load, we
distribute the hourly data to each of the 98 nodes to insert in parallel.
This worked ok for a few hours till we got 4.5B documents in the cluster.

After that the cluster state went to red. The outstanding tasks CAT API
shows errors like below. CPU/Disk/Memory seems to be fine on the nodes.

Why are we getting these errors?. any help greatly appreciated since this
blocks prototyping ES for our use case.

thanks
Darshat

Sample errors:

source : shard-failed ([agora_v1][24],
node[00ihc1ToRiqMDJ1lou1Sig], [R], s[INITIALIZING]),
reason [Failed to start shard, message
[RecoveryFailedException[[agora_v1][24]: Recovery
failed from [Shingen
Harada][RDAwqX9yRgud9f7YtZAJPg][CH1
SCH060051438][inet[/10.46.153.84:9300]] into
[Elfqueen][

00ihc1ToRiqMDJ1lou1Sig][CH1SCH050053435][inet[/10.46.182
.106:9300]]]; nested:
RemoteTransportException[[Shingen

Harada][inet[/10.46.153.84:9300]][internal:index/shard/r
ecovery/start_recovery]]; nested:
RecoveryEngineException[[agora_v1][24] Phase[1]
Execution failed]; nested:
RecoverFilesRecoveryException[[agora_v1][24] Failed
to
transfer [0] files with total size of [0b]]; nested:
NoS

uchFileException[D:\app\ES.ElasticSearch_v010\elasticsea

rch-1.4.1\data\AP-elasticsearch\nodes\0\indices\agora_v1
\24\index\segments_6r]; ]]

AND

source : shard-failed ([agora_v1][95],
node[PUsHFCStRaecPA6MuvJV9g], [P], s[INITIALIZING]),
reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[agora_v1][95]
failed to fetch index version after copying it over];
nested: CorruptIndexException[[agora_v1][95]
Preexisting corrupted index
[corrupted_1wegvS7BSKSbOYQkX9zJSw] caused by:
CorruptIndexException[Read past EOF while reading
segment infos]
EOFException[read past EOF:
MMapIndexInput(path="D:\

app\ES.ElasticSearch_v010\elasticsearch-1.4.1\data\AP-el

asticsearch\nodes\0\indices\agora_v1\95\index\segments_1
1j")]
org.apache.lucene.index.CorruptIndexException: Read
past EOF while reading segment infos
at
org.elasticsearch.index.store.Store.readSegmentsI
nfo(Store.java:127)
at
org.elasticsearch.index.store.Store.access$400(St
ore.java:80)
at
org.elasticsearch.index.store.Store$MetadataSnaps
hot.buildMetadata(Store.java:575)
---snip more stack trace-----

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Index-corruption-when-upload-large-number-of-documents-4billion-tp4068742.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1420774624607-4068742.post%40n3.nabble.com.
For more options, visit https://groups.google.com/d/optout.

Why did you snip the stack trace? can you provide all the information?

On Thu, Jan 8, 2015 at 10:37 PM, Darshat darshat@outlook.com wrote:

Hi,
We have a 98 node cluster of ES with each node 32GB RAM. 16GB is reserved
for ES via config file. The index has 98 shards with 2 replicas.

On this cluster we are loading a large number of documents (when done it
would be about 10 billion). In this use case about 40million documents are
generated per hour and we are pre-loading several days worth of documents to
prototype how ES will scale, and its query performance.

Right now we are facing problems getting data loaded. Indexing is turned
off. We use NEST client, with batch size of 10k. To speed up data load, we
distribute the hourly data to each of the 98 nodes to insert in parallel.
This worked ok for a few hours till we got 4.5B documents in the cluster.

After that the cluster state went to red. The outstanding tasks CAT API
shows errors like below. CPU/Disk/Memory seems to be fine on the nodes.

Why are we getting these errors?. any help greatly appreciated since this
blocks prototyping ES for our use case.

thanks
Darshat

Sample errors:

source : shard-failed ([agora_v1][24],
node[00ihc1ToRiqMDJ1lou1Sig], [R], s[INITIALIZING]),
reason [Failed to start shard, message
[RecoveryFailedException[[agora_v1][24]: Recovery
failed from [Shingen
Harada][RDAwqX9yRgud9f7YtZAJPg][CH1
SCH060051438][inet[/10.46.153.84:9300]] into
[Elfqueen][

00ihc1ToRiqMDJ1lou1Sig][CH1SCH050053435][inet[/10.46.182
.106:9300]]]; nested:
RemoteTransportException[[Shingen

Harada][inet[/10.46.153.84:9300]][internal:index/shard/r
ecovery/start_recovery]]; nested:
RecoveryEngineException[[agora_v1][24] Phase[1]
Execution failed]; nested:
RecoverFilesRecoveryException[[agora_v1][24] Failed
to
transfer [0] files with total size of [0b]]; nested:
NoS

uchFileException[D:\app\ES.Elasticsearch_v010\elasticsea

rch-1.4.1\data\AP-elasticsearch\nodes\0\indices\agora_v1
\24\index\segments_6r]; ]]

AND

source : shard-failed ([agora_v1][95],
node[PUsHFCStRaecPA6MuvJV9g], [P], s[INITIALIZING]),
reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[agora_v1][95]
failed to fetch index version after copying it over];
nested: CorruptIndexException[[agora_v1][95]
Preexisting corrupted index
[corrupted_1wegvS7BSKSbOYQkX9zJSw] caused by:
CorruptIndexException[Read past EOF while reading
segment infos]
EOFException[read past EOF:
MMapIndexInput(path="D:\

app\ES.Elasticsearch_v010\elasticsearch-1.4.1\data\AP-el

asticsearch\nodes\0\indices\agora_v1\95\index\segments_1
1j")]
org.apache.lucene.index.CorruptIndexException: Read
past EOF while reading segment infos
at
org.elasticsearch.index.store.Store.readSegmentsI
nfo(Store.java:127)
at
org.elasticsearch.index.store.Store.access$400(St
ore.java:80)
at
org.elasticsearch.index.store.Store$MetadataSnaps
hot.buildMetadata(Store.java:575)
---snip more stack trace-----

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Index-corruption-when-upload-large-number-of-documents-4billion-tp4068742.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1420774624607-4068742.post%40n3.nabble.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMUKNZULi4OYDGH5_4FOtkxBFhHUpnt4GCvAiBNHHWjp3a-ouw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi, the full stack is like this (from outstanding tasks api). We are using ES 1.4.1

insert_order : 69862
priority : HIGH
source : shard-failed ([agora_v1][24],
node[SEIBtFznTtGpLFPgCLgW4w], [R], s[INITIALIZING]),
reason [Failed to start shard, message
[CorruptIndexException[[agora_v1][24] Preexisting
corrupted index [corrupted_LrKHKRF7Q2KuL15TT_hPvw]
caused by: CorruptIndexException[Read past EOF while
reading segment infos]
EOFException[read past EOF: MMapIndexInput(path="D:
app\ES.ElasticSearch_v010\elasticsearch-1.4.1\data\AP-el
asticsearch\nodes\0\indices\agora_v1\24\index\segments_6
w")]
org.apache.lucene.index.CorruptIndexException: Read
past EOF while reading segment infos
at org.elasticsearch.index.store.Store.readSegmentsI
nfo(Store.java:127)
at org.elasticsearch.index.store.Store.access$400(St
ore.java:80)
at org.elasticsearch.index.store.Store$MetadataSnaps
hot.buildMetadata(Store.java:575)
at org.elasticsearch.index.store.Store$MetadataSnaps
hot.(Store.java:568)
at org.elasticsearch.index.store.Store.getMetadata(S
tore.java:186)
at org.elasticsearch.index.store.Store.getMetadataOr
Empty(Store.java:150)
at org.elasticsearch.indices.store.TransportNodesLis
tShardStoreMetaData.listStoreMetaData(TransportNodesList
ShardStoreMetaData.java:152)
at org.elasticsearch.indices.store.TransportNodesLis
tShardStoreMetaData.nodeOperation(TransportNodesListShar
dStoreMetaData.java:138)
at org.elasticsearch.indices.store.TransportNodesLis
tShardStoreMetaData.nodeOperation(TransportNodesListShar
dStoreMetaData.java:59)
at org.elasticsearch.action.support.nodes.TransportN
odesOperationAction$NodeTransportHandler.messageReceived
(TransportNodesOperationAction.java:278)
at org.elasticsearch.action.support.nodes.TransportN
odesOperationAction$NodeTransportHandler.messageReceived
(TransportNodesOperationAction.java:269)
at org.elasticsearch.transport.netty.MessageChannelH
andler$RequestHandler.run(MessageChannelHandler.java:275
)
at java.util.concurrent.ThreadPoolExecutor.runWorker
(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.ru
n(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException: read past EOF: MMapInde
xInput(path="D:\app\ES.ElasticSearch_v010\elasticsearch-
1.4.1\data\AP-elasticsearch\nodes\0\indices\agora_v1\24
index\segments_6w")
at org.apache.lucene.store.ByteBufferIndexInput.read
Byte(ByteBufferIndexInput.java:81)
at org.apache.lucene.store.BufferedChecksumIndexInpu
t.readByte(BufferedChecksumIndexInput.java:41)
at org.apache.lucene.store.DataInput.readInt(DataInp
ut.java:98)
at org.apache.lucene.index.SegmentInfos.read(Segment
Infos.java:343)
at org.apache.lucene.index.SegmentInfos$1.doBody(Seg
mentInfos.java:454)
at org.apache.lucene.index.SegmentInfos$FindSegments
File.run(SegmentInfos.java:906)
at org.apache.lucene.index.SegmentInfos$FindSegments
File.run(SegmentInfos.java:752)
at org.apache.lucene.index.SegmentInfos.read(Segment
Infos.java:450)
at org.elasticsearch.common.lucene.Lucene.readSegmen
tInfos(Lucene.java:85)
at org.elasticsearch.index.store.Store.readSegmentsI
nfo(Store.java:124)
... 14 more
]]]
executing : True
time_in_queue_millis : 52865
time_in_queue : 52.8s

insert_order : 69863
priority : HIGH
source : shard-failed ([agora_v1][24],
node[SEIBtFznTtGpLFPgCLgW4w], [R], s[INITIALIZING]),
reason [engine failure, message [corrupted preexisting
index][CorruptIndexException[[agora_v1][24] Preexisting
corrupted index [corrupted_LrKHKRF7Q2KuL15TT_hPvw]
caused by: CorruptIndexException[Read past EOF while
reading segment infos]
EOFException[read past EOF: MMapIndexInput(path="D:
app\ES.ElasticSearch_v010\elasticsearch-1.4.1\data\AP-el
asticsearch\nodes\0\indices\agora_v1\24\index\segments_6
w")]
org.apache.lucene.index.CorruptIndexException: Read
past EOF while reading segment infos
at org.elasticsearch.index.store.Store.readSegmentsI
nfo(Store.java:127)
at org.elasticsearch.index.store.Store.access$400(St
ore.java:80)
at org.elasticsearch.index.store.Store$MetadataSnaps
hot.buildMetadata(Store.java:575)
at org.elasticsearch.index.store.Store$MetadataSnaps
hot.(Store.java:568)
at org.elasticsearch.index.store.Store.getMetadata(S
tore.java:186)
at org.elasticsearch.index.store.Store.getMetadataOr
Empty(Store.java:150)
at org.elasticsearch.indices.store.TransportNodesLis
tShardStoreMetaData.listStoreMetaData(TransportNodesList
ShardStoreMetaData.java:152)
at org.elasticsearch.indices.store.TransportNodesLis
tShardStoreMetaData.nodeOperation(TransportNodesListShar
dStoreMetaData.java:138)
at org.elasticsearch.indices.store.TransportNodesLis
tShardStoreMetaData.nodeOperation(TransportNodesListShar
dStoreMetaData.java:59)
at org.elasticsearch.action.support.nodes.TransportN
odesOperationAction$NodeTransportHandler.messageReceived
(TransportNodesOperationAction.java:278)
at org.elasticsearch.action.support.nodes.TransportN
odesOperationAction$NodeTransportHandler.messageReceived
(TransportNodesOperationAction.java:269)
at org.elasticsearch.transport.netty.MessageChannelH
andler$RequestHandler.run(MessageChannelHandler.java:275
)
at java.util.concurrent.ThreadPoolExecutor.runWorker
(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.ru
n(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException: read past EOF: MMapInde
xInput(path="D:\app\ES.ElasticSearch_v010\elasticsearch-
1.4.1\data\AP-elasticsearch\nodes\0\indices\agora_v1\24
index\segments_6w")
at org.apache.lucene.store.ByteBufferIndexInput.read
Byte(ByteBufferIndexInput.java:81)
at org.apache.lucene.store.BufferedChecksumIndexInpu
t.readByte(BufferedChecksumIndexInput.java:41)
at org.apache.lucene.store.DataInput.readInt(DataInp
ut.java:98)
at org.apache.lucene.index.SegmentInfos.read(Segment
Infos.java:343)
at org.apache.lucene.index.SegmentInfos$1.doBody(Seg
mentInfos.java:454)
at org.apache.lucene.index.SegmentInfos$FindSegments
File.run(SegmentInfos.java:906)
at org.apache.lucene.index.SegmentInfos$FindSegments
File.run(SegmentInfos.java:752)
at org.apache.lucene.index.SegmentInfos.read(Segment
Infos.java:450)
at org.elasticsearch.common.lucene.Lucene.readSegmen
tInfos(Lucene.java:85)
at org.elasticsearch.index.store.Store.readSegmentsI
nfo(Store.java:124)
... 14 more
]]]
executing : False
time_in_queue_millis : 52862
time_in_queue : 52.8s

insert_order : 69865
priority : HIGH
source : shard-failed ([kibana-int][88],
node[adjp-WHHSP6kWEiPd3HkeQ], [R], s[INITIALIZING]),
reason [Failed to start shard, message
[RecoveryFailedException[[kibana-int][88]: Recovery
failed from [Quasimodo][spfLOfnjTeiGwrYPMIiRjg][CH1SCH06
0021734][inet[/10.46.208.169:9300]] into [Hyperion][adjp
-WHHSP6kWEiPd3HkeQ][CH1SCH050051642][inet[/10.46.216.169
:9300]]]; nested: RemoteTransportException[[Quasimodo][i net[/10.46.208.169:9300]][internal:index/shard/recovery/
start_recovery]]; nested:
RecoveryEngineException[[kibana-int][88] Phase[1]
Execution failed]; nested:
RecoverFilesRecoveryException[[kibana-int][88] Failed
to transfer [0] files with total size of [0b]]; nested:
NoSuchFileException[D:\app\ES.ElasticSearch_v010\elastic
search-1.4.1\data\AP-elasticsearch\nodes\0\indices\kiban
a-int\88\index\segments_2]; ]]
executing : False
time_in_queue_millis : 52860
time_in_queue : 52.8s

can you try with disable.allocaltion:true. this will tell es that dont
allocate shards automatically. is the hard disk good or perhaps look at
disk failure or bad track

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cdf1abfb-cbcc-470b-893a-591f98a871ff%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi, the full stack is like this (from outstanding tasks api). We are using
ES 1.4.1

insert_order : 69862
priority : HIGH
source : shard-failed ([agora_v1][24],
node[SEIBtFznTtGpLFPgCLgW4w], [R], s[INITIALIZING]),
reason [Failed to start shard, message
[CorruptIndexException[[agora_v1][24] Preexisting
corrupted index [corrupted_LrKHKRF7Q2KuL15TT_hPvw]
caused by: CorruptIndexException[Read past EOF while
reading segment infos]
EOFException[read past EOF:
MMapIndexInput(path="D:\

app\ES.Elasticsearch_v010\elasticsearch-1.4.1\data\AP-el

asticsearch\nodes\0\indices\agora_v1\24\index\segments_6
w")]
org.apache.lucene.index.CorruptIndexException: Read
past EOF while reading segment infos
at
org.elasticsearch.index.store.Store.readSegmentsI
nfo(Store.java:127)
at
org.elasticsearch.index.store.Store.access$400(St
ore.java:80)
at
org.elasticsearch.index.store.Store$MetadataSnaps
hot.buildMetadata(Store.java:575)
at
org.elasticsearch.index.store.Store$MetadataSnaps
hot.(Store.java:568)
at
org.elasticsearch.index.store.Store.getMetadata(S
tore.java:186)
at
org.elasticsearch.index.store.Store.getMetadataOr
Empty(Store.java:150)
at
org.elasticsearch.indices.store.TransportNodesLis

tShardStoreMetaData.listStoreMetaData(TransportNodesList
ShardStoreMetaData.java:152)
at
org.elasticsearch.indices.store.TransportNodesLis

tShardStoreMetaData.nodeOperation(TransportNodesListShar
dStoreMetaData.java:138)
at
org.elasticsearch.indices.store.TransportNodesLis

tShardStoreMetaData.nodeOperation(TransportNodesListShar
dStoreMetaData.java:59)
at
org.elasticsearch.action.support.nodes.TransportN

odesOperationAction$NodeTransportHandler.messageReceived
(TransportNodesOperationAction.java:278)
at
org.elasticsearch.action.support.nodes.TransportN

odesOperationAction$NodeTransportHandler.messageReceived
(TransportNodesOperationAction.java:269)
at
org.elasticsearch.transport.netty.MessageChannelH

andler$RequestHandler.run(MessageChannelHandler.java:275
)
at
java.util.concurrent.ThreadPoolExecutor.runWorker
(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.ru
n(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException: read past EOF:
MMapInde

xInput(path="D:\app\ES.Elasticsearch_v010\elasticsearch-

1.4.1\data\AP-elasticsearch\nodes\0\indices\agora_v1\24\
index\segments_6w")
at
org.apache.lucene.store.ByteBufferIndexInput.read
Byte(ByteBufferIndexInput.java:81)
at
org.apache.lucene.store.BufferedChecksumIndexInpu
t.readByte(BufferedChecksumIndexInput.java:41)
at
org.apache.lucene.store.DataInput.readInt(DataInp
ut.java:98)
at
org.apache.lucene.index.SegmentInfos.read(Segment
Infos.java:343)
at
org.apache.lucene.index.SegmentInfos$1.doBody(Seg
mentInfos.java:454)
at
org.apache.lucene.index.SegmentInfos$FindSegments
File.run(SegmentInfos.java:906)
at
org.apache.lucene.index.SegmentInfos$FindSegments
File.run(SegmentInfos.java:752)
at
org.apache.lucene.index.SegmentInfos.read(Segment
Infos.java:450)
at
org.elasticsearch.common.lucene.Lucene.readSegmen
tInfos(Lucene.java:85)
at
org.elasticsearch.index.store.Store.readSegmentsI
nfo(Store.java:124)
... 14 more
]]]
executing : True
time_in_queue_millis : 52865
time_in_queue : 52.8s

insert_order : 69863
priority : HIGH
source : shard-failed ([agora_v1][24],
node[SEIBtFznTtGpLFPgCLgW4w], [R], s[INITIALIZING]),
reason [engine failure, message [corrupted
preexisting
index][CorruptIndexException[[agora_v1][24]
Preexisting
corrupted index [corrupted_LrKHKRF7Q2KuL15TT_hPvw]
caused by: CorruptIndexException[Read past EOF while
reading segment infos]
EOFException[read past EOF:
MMapIndexInput(path="D:\

app\ES.Elasticsearch_v010\elasticsearch-1.4.1\data\AP-el

asticsearch\nodes\0\indices\agora_v1\24\index\segments_6
w")]
org.apache.lucene.index.CorruptIndexException: Read
past EOF while reading segment infos
at
org.elasticsearch.index.store.Store.readSegmentsI
nfo(Store.java:127)
at
org.elasticsearch.index.store.Store.access$400(St
ore.java:80)
at
org.elasticsearch.index.store.Store$MetadataSnaps
hot.buildMetadata(Store.java:575)
at
org.elasticsearch.index.store.Store$MetadataSnaps
hot.(Store.java:568)
at
org.elasticsearch.index.store.Store.getMetadata(S
tore.java:186)
at
org.elasticsearch.index.store.Store.getMetadataOr
Empty(Store.java:150)
at
org.elasticsearch.indices.store.TransportNodesLis

tShardStoreMetaData.listStoreMetaData(TransportNodesList
ShardStoreMetaData.java:152)
at
org.elasticsearch.indices.store.TransportNodesLis

tShardStoreMetaData.nodeOperation(TransportNodesListShar
dStoreMetaData.java:138)
at
org.elasticsearch.indices.store.TransportNodesLis

tShardStoreMetaData.nodeOperation(TransportNodesListShar
dStoreMetaData.java:59)
at
org.elasticsearch.action.support.nodes.TransportN

odesOperationAction$NodeTransportHandler.messageReceived
(TransportNodesOperationAction.java:278)
at
org.elasticsearch.action.support.nodes.TransportN

odesOperationAction$NodeTransportHandler.messageReceived
(TransportNodesOperationAction.java:269)
at
org.elasticsearch.transport.netty.MessageChannelH

andler$RequestHandler.run(MessageChannelHandler.java:275
)
at
java.util.concurrent.ThreadPoolExecutor.runWorker
(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.ru
n(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException: read past EOF:
MMapInde

xInput(path="D:\app\ES.Elasticsearch_v010\elasticsearch-

1.4.1\data\AP-elasticsearch\nodes\0\indices\agora_v1\24\
index\segments_6w")
at
org.apache.lucene.store.ByteBufferIndexInput.read
Byte(ByteBufferIndexInput.java:81)
at
org.apache.lucene.store.BufferedChecksumIndexInpu
t.readByte(BufferedChecksumIndexInput.java:41)
at
org.apache.lucene.store.DataInput.readInt(DataInp
ut.java:98)
at
org.apache.lucene.index.SegmentInfos.read(Segment
Infos.java:343)
at
org.apache.lucene.index.SegmentInfos$1.doBody(Seg
mentInfos.java:454)
at
org.apache.lucene.index.SegmentInfos$FindSegments
File.run(SegmentInfos.java:906)
at
org.apache.lucene.index.SegmentInfos$FindSegments
File.run(SegmentInfos.java:752)
at
org.apache.lucene.index.SegmentInfos.read(Segment
Infos.java:450)
at
org.elasticsearch.common.lucene.Lucene.readSegmen
tInfos(Lucene.java:85)
at
org.elasticsearch.index.store.Store.readSegmentsI
nfo(Store.java:124)
... 14 more
]]]
executing : False
time_in_queue_millis : 52862
time_in_queue : 52.8s

insert_order : 69865
priority : HIGH
source : shard-failed ([kibana-int][88],
node[adjp-WHHSP6kWEiPd3HkeQ], [R], s[INITIALIZING]),
reason [Failed to start shard, message
[RecoveryFailedException[[kibana-int][88]: Recovery
failed from
[Quasimodo][spfLOfnjTeiGwrYPMIiRjg][CH1SCH06
0021734][inet[/10.46.208.169:9300]] into
[Hyperion][adjp

-WHHSP6kWEiPd3HkeQ][CH1SCH050051642][inet[/10.46.216.169
:9300]]]; nested:
RemoteTransportException[[Quasimodo][i

net[/10.46.208.169:9300]][internal:index/shard/recovery/
start_recovery]]; nested:
RecoveryEngineException[[kibana-int][88] Phase[1]
Execution failed]; nested:
RecoverFilesRecoveryException[[kibana-int][88]
Failed
to transfer [0] files with total size of [0b]];
nested:

NoSuchFileException[D:\app\ES.Elasticsearch_v010\elastic

search-1.4.1\data\AP-elasticsearch\nodes\0\indices\kiban
a-int\88\index\segments_2]; ]]
executing : False
time_in_queue_millis : 52860
time_in_queue : 52.8s

On Friday, January 9, 2015 at 5:50:44 PM UTC+5:30, Robert Muir wrote:

Why did you snip the stack trace? can you provide all the information?

On Thu, Jan 8, 2015 at 10:37 PM, Darshat <dar...@outlook.com <javascript:>>
wrote:

Hi,
We have a 98 node cluster of ES with each node 32GB RAM. 16GB is
reserved
for ES via config file. The index has 98 shards with 2 replicas.

On this cluster we are loading a large number of documents (when done it
would be about 10 billion). In this use case about 40million documents
are
generated per hour and we are pre-loading several days worth of
documents to
prototype how ES will scale, and its query performance.

Right now we are facing problems getting data loaded. Indexing is turned
off. We use NEST client, with batch size of 10k. To speed up data load,
we
distribute the hourly data to each of the 98 nodes to insert in
parallel.
This worked ok for a few hours till we got 4.5B documents in the
cluster.

After that the cluster state went to red. The outstanding tasks CAT API
shows errors like below. CPU/Disk/Memory seems to be fine on the nodes.

Why are we getting these errors?. any help greatly appreciated since
this
blocks prototyping ES for our use case.

thanks
Darshat

Sample errors:

source : shard-failed ([agora_v1][24],
node[00ihc1ToRiqMDJ1lou1Sig], [R],
s[INITIALIZING]),
reason [Failed to start shard, message
[RecoveryFailedException[[agora_v1][24]: Recovery
failed from [Shingen
Harada][RDAwqX9yRgud9f7YtZAJPg][CH1
SCH060051438][inet[/10.46.153.84:9300]] into
[Elfqueen][

00ihc1ToRiqMDJ1lou1Sig][CH1SCH050053435][inet[/10.46.182
.106:9300]]]; nested:
RemoteTransportException[[Shingen

Harada][inet[/10.46.153.84:9300]][internal:index/shard/r
ecovery/start_recovery]]; nested:
RecoveryEngineException[[agora_v1][24] Phase[1]
Execution failed]; nested:
RecoverFilesRecoveryException[[agora_v1][24]
Failed
to
transfer [0] files with total size of [0b]];
nested:
NoS

uchFileException[D:\app\ES.Elasticsearch_v010\elasticsea

rch-1.4.1\data\AP-elasticsearch\nodes\0\indices\agora_v1
\24\index\segments_6r]; ]]

AND

source : shard-failed ([agora_v1][95],
node[PUsHFCStRaecPA6MuvJV9g], [P],
s[INITIALIZING]),
reason [Failed to start shard, message

[IndexShardGatewayRecoveryException[[agora_v1][95]

                   failed to fetch index version after copying it 

over];

                   nested: CorruptIndexException[[agora_v1][95] 
                   Preexisting corrupted index 
                   [corrupted_1wegvS7BSKSbOYQkX9zJSw] caused by: 
                   CorruptIndexException[Read past EOF while reading 
                   segment infos] 
                       EOFException[read past EOF: 

MMapIndexInput(path="D:\

app\ES.Elasticsearch_v010\elasticsearch-1.4.1\data\AP-el

asticsearch\nodes\0\indices\agora_v1\95\index\segments_1
1j")]
org.apache.lucene.index.CorruptIndexException:
Read
past EOF while reading segment infos
at
org.elasticsearch.index.store.Store.readSegmentsI
nfo(Store.java:127)
at
org.elasticsearch.index.store.Store.access$400(St
ore.java:80)
at
org.elasticsearch.index.store.Store$MetadataSnaps
hot.buildMetadata(Store.java:575)
---snip more stack trace-----

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Index-corruption-when-upload-large-number-of-documents-4billion-tp4068742.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1420774624607-4068742.post%40n3.nabble.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/301a744c-6dfa-44f7-95ca-1ca007634d37%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.