Corruption when indexing large number of documents (4 billion+)

Hi,
We have a 98 node cluster of ES with each node 32GB RAM. 16GB is reserved
for ES via config file. The index has 98 shards with 2 replicas.

On this cluster we are loading a large number of documents (when done it
would be about 10 billion). About 40million documents are generated per
hour and we are pre-loading several days worth of documents to prototype
how ES will scale, and its query performance.

Right now we are facing problems getting data pre-loaded. Indexing is
turned off. We use NEST client, with batch size of 10k. To speed up data
load, we distribute the hourly data to each of the 98 nodes to insert in
parallel. This worked ok for a few hours till we got 4.5B documents in the
cluster.

After that the cluster state went to red. The outstanding tasks CAT API
shows errors like below. CPU/Disk/Memory seems to be fine on the nodes.

Why are we getting these errors and is there a best practice? any help
greatly appreciated since this blocks prototyping ES for our use case.

thanks
Darshat

Sample errors:

source : shard-failed ([agora_v1][24],
node[00ihc1ToRiqMDJ1lou1Sig], [R], s[INITIALIZING]),
reason [Failed to start shard, message
[RecoveryFailedException[[agora_v1][24]: Recovery
failed from [Shingen
Harada][RDAwqX9yRgud9f7YtZAJPg][CH1
SCH060051438][inet[/10.46.153.84:9300]] into
[Elfqueen][

00ihc1ToRiqMDJ1lou1Sig][CH1SCH050053435][inet[/10.46.182
.106:9300]]]; nested:
RemoteTransportException[[Shingen

Harada][inet[/10.46.153.84:9300]][internal:index/shard/r
ecovery/start_recovery]]; nested:
RecoveryEngineException[[agora_v1][24] Phase[1]
Execution failed]; nested:
RecoverFilesRecoveryException[[agora_v1][24] Failed
to
transfer [0] files with total size of [0b]]; nested:
NoS

uchFileException[D:\app\ES.ElasticSearch_v010\elasticsea

rch-1.4.1\data\AP-elasticsearch\nodes\0\indices\agora_v1
\24\index\segments_6r]; ]]

AND

source : shard-failed ([agora_v1][95],
node[PUsHFCStRaecPA6MuvJV9g], [P], s[INITIALIZING]),
reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[agora_v1][95]
failed to fetch index version after copying it
over];
nested: CorruptIndexException[[agora_v1][95]
Preexisting corrupted index
[corrupted_1wegvS7BSKSbOYQkX9zJSw] caused by:
CorruptIndexException[Read past EOF while reading
segment infos]
EOFException[read past EOF:
MMapIndexInput(path="D:\

app\ES.ElasticSearch_v010\elasticsearch-1.4.1\data\AP-el

asticsearch\nodes\0\indices\agora_v1\95\index\segments_1
1j")]
org.apache.lucene.index.CorruptIndexException: Read
past EOF while reading segment infos
at
org.elasticsearch.index.store.Store.readSegmentsI
nfo(Store.java:127)
at
org.elasticsearch.index.store.Store.access$400(St
ore.java:80)
at
org.elasticsearch.index.store.Store$MetadataSnaps
hot.buildMetadata(Store.java:575)
---snip more stack trace-----

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0f24b939-2cba-41a9-8de8-49565f77e567%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Honestly, with this sort of scale you should be thinking about support
(disclaimer: I work for Elasticsearch support).

However let's see what we can do;
What version of ES, java?
What are you using to monitor your cluster?
How many GB is that index?
Is it in one massive index?
How many GB in your data in total?
Why do you have 2 replicas?
Are you searching while indexing, or just indexing the data? If it's the
latter then you might want to try disabling replica's and then setting the
index refresh rate to -1 for the index, insert your data, and then turn
refresh back on and then let the data index. That's best practice for large
amounts of indexing.

Also, consider dropping your bulk size down to 5K, that's generally
considered the upper limit for bulk API batches.

On 9 January 2015 at 14:44, Darshat Shah darshat@gmail.com wrote:

Hi,
We have a 98 node cluster of ES with each node 32GB RAM. 16GB is reserved
for ES via config file. The index has 98 shards with 2 replicas.

On this cluster we are loading a large number of documents (when done it
would be about 10 billion). About 40million documents are generated per
hour and we are pre-loading several days worth of documents to prototype
how ES will scale, and its query performance.

Right now we are facing problems getting data pre-loaded. Indexing is
turned off. We use NEST client, with batch size of 10k. To speed up data
load, we distribute the hourly data to each of the 98 nodes to insert in
parallel. This worked ok for a few hours till we got 4.5B documents in the
cluster.

After that the cluster state went to red. The outstanding tasks CAT API
shows errors like below. CPU/Disk/Memory seems to be fine on the nodes.

Why are we getting these errors and is there a best practice? any help
greatly appreciated since this blocks prototyping ES for our use case.

thanks
Darshat

Sample errors:

source : shard-failed ([agora_v1][24],
node[00ihc1ToRiqMDJ1lou1Sig], [R],
s[INITIALIZING]),
reason [Failed to start shard, message
[RecoveryFailedException[[agora_v1][24]: Recovery
failed from [Shingen
Harada][RDAwqX9yRgud9f7YtZAJPg][CH1
SCH060051438][inet[/10.46.153.84:9300]] into
[Elfqueen][

00ihc1ToRiqMDJ1lou1Sig][CH1SCH050053435][inet[/10.46.182
.106:9300]]]; nested:
RemoteTransportException[[Shingen
Harada][inet[/10.46.153.84:9300]][internal:index/shard/r

                   ecovery/start_recovery]]; nested:
                   RecoveryEngineException[[agora_v1][24] Phase[1]
                   Execution failed]; nested:
                   RecoverFilesRecoveryException[[agora_v1][24] Failed

to
transfer [0] files with total size of [0b]];
nested: NoS

uchFileException[D:\app\ES.ElasticSearch_v010\elasticsea

rch-1.4.1\data\AP-elasticsearch\nodes\0\indices\agora_v1
\24\index\segments_6r]; ]]

AND

source : shard-failed ([agora_v1][95],
node[PUsHFCStRaecPA6MuvJV9g], [P],
s[INITIALIZING]),
reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[agora_v1][95]
failed to fetch index version after copying it
over];
nested: CorruptIndexException[[agora_v1][95]
Preexisting corrupted index
[corrupted_1wegvS7BSKSbOYQkX9zJSw] caused by:
CorruptIndexException[Read past EOF while reading
segment infos]
EOFException[read past EOF:
MMapIndexInput(path="D:\

app\ES.ElasticSearch_v010\elasticsearch-1.4.1\data\AP-el

asticsearch\nodes\0\indices\agora_v1\95\index\segments_1
1j")]
org.apache.lucene.index.CorruptIndexException: Read
past EOF while reading segment infos
at
org.elasticsearch.index.store.Store.readSegmentsI
nfo(Store.java:127)
at
org.elasticsearch.index.store.Store.access$400(St
ore.java:80)
at
org.elasticsearch.index.store.Store$MetadataSnaps
hot.buildMetadata(Store.java:575)
---snip more stack trace-----

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0f24b939-2cba-41a9-8de8-49565f77e567%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0f24b939-2cba-41a9-8de8-49565f77e567%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_F07YZO5%2BDpRAyUE-RM5K0vs4JWQY3j8chbmjyzW7eng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Mark,
Thanks for the reply. I need to prototype and demonstrate at this scale first to ensure feasibility. Once we've proven ES works for this use case then its quite possible that we'd engage with support for production.

Regarding your questions:
What version of ES, java?
[Darshat]ES 1.4.1, JVM 1.8.0 (latest I found from Oracle for Win64)

What are you using to monitor your cluster?

Not much really. I tried installing marvel after we ran into issues, but mostly looking at cat apis and indices apis.

How many GB is that index?

About 1GB for every million entries. We don't have string fields but many numeric fields on which aggregations are needed. With 4.5billion docs, the total index size was about 4.5 TB spread over the 98 nodes.

Is it in one massive index?

Yes. We need aggregations over this data across multiple fields

How many GB in your data in total?

It can be order of 30TB.

Why do you have 2 replicas?

Data loss is generally considered not ok. However for our next run, as you suggest, we will start with 0 replica and update it after bulk load is done.

Are you searching while indexing, or just indexing the data? If it's the latter then you might want to try disabling replica's and then setting the index refresh rate to -1 for the index, insert your data, and then turn refresh back on and then let the data index. That's best practice for large amounts of indexing.

Just indexing to get historical data up there. We did set the refresh to -1 before we began upload.

Also, consider dropping your bulk size down to 5K, that's generally considered the upper limit for bulk API batches.

We are going to attempt another upload with these changes. I also set the index_buffer_size to 40% in case it helps.

Hi Mark,
Thanks for the reply. I need to prototype and demonstrate at this scale
first to ensure feasibility. Once we've proven ES works for this use case
then its quite possible that we'd engage with support for production.

Regarding your questions:
*What version of ES, java? *
[Darshat]ES 1.4.1, JVM 1.8.0 (latest I found from Oracle for Win64)

*What are you using to monitor your cluster? *
Not much really. I tried installing marvel after we ran into issues, but
mostly looking at cat apis and indices apis.

*How many GB is that index? *
About 1GB for every million entries. We don't have string fields but many
numeric fields on which aggregations are needed. With 4.5billion docs, the
total index size was about 4.5 TB spread over the 98 nodes.

*Is it in one massive index? *
Yes one hell of an index if it works! We need aggregations over this data
across multiple fields

*How many GB in your data in total? *
It can be order of 30TB.

*Why do you have 2 replicas? *
Data loss is generally considered not ok. However for our next run, as you
suggest, we will start with 0 replica and update it after bulk load is
done.

*Are you searching while indexing, or just indexing the data? If it's the
latter then you might want to try disabling replica's and then setting the
index refresh rate to -1 for the index, insert your data, and then turn
refresh back on and then let the data index. That's best practice for large
amounts of indexing. *
Just indexing to get historical data up there. We did set the refresh to -1
before we began upload.

*Also, consider dropping your bulk size down to 5K, that's generally
considered the upper limit for bulk API batches. *

We are going to attempt another upload with these changes. I also set the
index_buffer_size to 40% in case it helps.

On Friday, January 9, 2015 at 3:29:48 PM UTC+5:30, Mark Walkom wrote:

Honestly, with this sort of scale you should be thinking about support
(disclaimer: I work for Elasticsearch support).

However let's see what we can do;
What version of ES, java?
What are you using to monitor your cluster?
How many GB is that index?
Is it in one massive index?
How many GB in your data in total?
Why do you have 2 replicas?
Are you searching while indexing, or just indexing the data? If it's the
latter then you might want to try disabling replica's and then setting the
index refresh rate to -1 for the index, insert your data, and then turn
refresh back on and then let the data index. That's best practice for large
amounts of indexing.

Also, consider dropping your bulk size down to 5K, that's generally
considered the upper limit for bulk API batches.

On 9 January 2015 at 14:44, Darshat Shah <dar...@gmail.com <javascript:>>
wrote:

Hi,
We have a 98 node cluster of ES with each node 32GB RAM. 16GB is reserved
for ES via config file. The index has 98 shards with 2 replicas.

On this cluster we are loading a large number of documents (when done it
would be about 10 billion). About 40million documents are generated per
hour and we are pre-loading several days worth of documents to prototype
how ES will scale, and its query performance.

Right now we are facing problems getting data pre-loaded. Indexing is
turned off. We use NEST client, with batch size of 10k. To speed up data
load, we distribute the hourly data to each of the 98 nodes to insert in
parallel. This worked ok for a few hours till we got 4.5B documents in the
cluster.

After that the cluster state went to red. The outstanding tasks CAT API
shows errors like below. CPU/Disk/Memory seems to be fine on the nodes.

Why are we getting these errors and is there a best practice? any help
greatly appreciated since this blocks prototyping ES for our use case.

thanks
Darshat

Sample errors:

source : shard-failed ([agora_v1][24],
node[00ihc1ToRiqMDJ1lou1Sig], [R],
s[INITIALIZING]),
reason [Failed to start shard, message
[RecoveryFailedException[[agora_v1][24]: Recovery
failed from [Shingen
Harada][RDAwqX9yRgud9f7YtZAJPg][CH1
SCH060051438][inet[/10.46.153.84:9300]] into
[Elfqueen][

00ihc1ToRiqMDJ1lou1Sig][CH1SCH050053435][inet[/10.46.182
.106:9300]]]; nested:
RemoteTransportException[[Shingen

Harada][inet[/10.46.153.84:9300]][internal:index/shard/r
ecovery/start_recovery]]; nested:
RecoveryEngineException[[agora_v1][24] Phase[1]
Execution failed]; nested:
RecoverFilesRecoveryException[[agora_v1][24]
Failed to
transfer [0] files with total size of [0b]];
nested: NoS

uchFileException[D:\app\ES.ElasticSearch_v010\elasticsea

rch-1.4.1\data\AP-elasticsearch\nodes\0\indices\agora_v1
\24\index\segments_6r]; ]]

AND

source : shard-failed ([agora_v1][95],
node[PUsHFCStRaecPA6MuvJV9g], [P],
s[INITIALIZING]),
reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[agora_v1][95]
failed to fetch index version after copying it
over];
nested: CorruptIndexException[[agora_v1][95]
Preexisting corrupted index
[corrupted_1wegvS7BSKSbOYQkX9zJSw] caused by:
CorruptIndexException[Read past EOF while reading
segment infos]
EOFException[read past EOF:
MMapIndexInput(path="D:\

app\ES.ElasticSearch_v010\elasticsearch-1.4.1\data\AP-el

asticsearch\nodes\0\indices\agora_v1\95\index\segments_1
1j")]
org.apache.lucene.index.CorruptIndexException:
Read
past EOF while reading segment infos
at
org.elasticsearch.index.store.Store.readSegmentsI
nfo(Store.java:127)
at
org.elasticsearch.index.store.Store.access$400(St
ore.java:80)
at
org.elasticsearch.index.store.Store$MetadataSnaps
hot.buildMetadata(Store.java:575)
---snip more stack trace-----

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0f24b939-2cba-41a9-8de8-49565f77e567%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0f24b939-2cba-41a9-8de8-49565f77e567%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e038b231-8eda-43b7-af83-3e5b1041649d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Mark,
We restarted our data upload process (setting replica=0 and batchsize to
5k). It proceeded for about 24hrs, somewhere after that we started to see
issues with master not being present in the cluster.

The log on the master node has lines like below. It appears there might be
a network issue where some nodes were not pingable and fell below the
threshold of minimum quorum for master node election, and the master sort
of demoted itself. But whats not clear is why a new master did not get
elected for several hours after this. The cluster cannot be queried. Whats
the best way to recover from this situation?

[2015-01-11 19:12:12,612][WARN ][discovery.zen ] [Earthquake]
not enough master nodes, current nodes:
{[Eliminator][8_j19XFyQcupAkRUYCjI6w][CH1SCH050103339][inet[/10.47.44.70:9300]],[Dragoness][5wL9nCAESDOs3jLMBzvunQ][CH1SCH060022834][inet[/10.47.50.155:9300]],[Ludi][7zEEioZIRMigz25356K7Yg][CH1SCH050051103][inet[/10.46.196.234:9300]],[The
Angel][BbFDPq87TMiVgMeNWeJIFg][CH1SCH060040431][inet[/10.47.33.247:9300]],[Helio][5UekTGE2Qq6YcyDb-daxmw][CH1SCH050103338][inet[/10.46.238.181:9300]],[Stacy
X][_sCSZZVtTGmoowmg-1ggVQ][CH1SCH050051544][inet[/10.46.178.74:9300]],[Thunderfist][wKHKYQ9qT9GPLdt6tCQ3Rg][CH1SCH060033137][inet[/10.47.58.209:9300]],[Dominic
Petros][E85WKXuLRDC4_WXzoyaTnQ][CH1SCH060021537][inet[/10.46.234.92:9300]],[Umbo][YFysv6iRSLWSJBax5ObMHg][CH1SCH060021735][inet[/10.46.243.90:9300]],[Earthquake][-8C9YSAbQMy4THZIgV7qUw][CH1SCH060051026][inet[/10.46.153.118:9300]],[Atum][0DWVWBrZR2OFIEQTq9QBBw][CH1SCH050051545][inet[/10.47.50.232:9300]],[Baron
Von
Blitzschlag][mADN2KUYS2uPiuSl4qMHIw][CH1SCH060040630][inet[/10.47.42.139:9300]],[Foolkiller][Iw9PyArIR2OCDaCm1gW-Vw][CH1SCH060040629][inet[/10.46.149.174:9300]],[The
Grip][38K3nW5wR7aUCkR6i8BCiQ][CH1SCH050051645][inet[/10.46.203.126:9300]],[Shamrock][aObajuGRShKIA9TXF5_yQg][CH1SCH050071331][inet[/10.46.194.122:9300]],[Orphan][dCC94_8VSHqP7_hFf3r3Qg][CH1SCH060051535][inet[/10.47.46.65:9300]],[Tombstone][dq1NNI8IREuPig2qaT27Xw][CH1SCH050102843][inet[/10.47.54.208:9300]],[Logan][C5bigQQ1QomGpExc9NQq-w][CH1SCH060040333][inet[/10.46.164.129:9300]],[Rama-Tut][gLnUtgR6Q12r4oJKUiKGZQ][CH1SCH050051642][inet[/10.46.216.169:9300]],[Grotesk][Dc61zzt9QRC6mYDeswKugw][CH1SCH050051640][inet[/10.47.9.152:9300]],[Machine
Teen][O39QiBsaQ6iq1kJvOw_QDA][CH1SCH060021439][inet[/10.46.242.110:9300]],[Edwin
Jarvis][31cJWsbnRDuvWNJ8X5ybJQ][CH1SCH050063532][inet[/10.46.190.247:9300]],[Harry
Osborn][wBaDgNdJSkmW5vCAfODQXA][CH1SCH060021440][inet[/10.47.20.183:9300]],[Brain
Drain][Oy4Mo_2HSO6Y_fX3C11YBw][CH1SCH060040433][inet[/10.47.0.136:9300]],[Romany
Wisdom][20Tc5x2ATdC3o6AdD7f67w][CH1SCH060031345][inet[/10.47.3.169:9300]],[Jekyll][lS_ugcBORj-485okkVScPg][CH1SCH060033044][inet[/10.46.41.179:9300]],[Ajaxis][q4Shk4fhSWuYiZVq2oCnWw][CH1SCH060021243][inet[/10.46.221.243:9300]]{master=false},[Cheetah][GH_6-7OQQ7CH1YrObU6wVA][CH1SCH060051532][inet[/10.46.150.153:9300]],[Marduk
Kurios][Aqfc5EUIQYmvacnY6sGEpg][CH1SCH050051546][inet[/10.46.162.240:9300]],[Left
Hand][VloLsymETGmy9Yu14TkygA][CH1SCH060051521][inet[/10.47.63.194:9300]],[Washout][tHz2w8mGRmKO9W7vUljXrA][CH1SCH050063531][inet[/10.47.54.30:9300]],[Fixer][HH3gxjFoSxeHbvF4SMnVsw][CH1SCH050051643][inet[/10.46.173.163:9300]],[Tyrannus][g1znl166QaO84xJjsABsfQ][CH1SCH060033139][inet[/10.46.42.62:9300]],[Master
Pandemonium][p7ALngKoTrW9NaTKIyP1XA][CH1SCH060051421][inet[/10.47.23.193:9300]],[Digitek][z5SzLFtlTWCL0R5cJQWqUw][CH1SCH050051641][inet[/10.46.154.21:9300]],[Jane
Kincaid][UtjfYoxjQ2mqa40vX_eGhQ][CH1SCH060041325][inet[/10.47.2.144:9300]],[Baron
Samedi][a-gBYy2vTFKgdjwG6jWeSA][CH1SCH060033138][inet[/10.47.63.100:9300]],[Fan
Boy][-iehboPZTHud5kttbYlQyQ][CH1SCH060051438][inet[/10.46.153.84:9300]],[Unseen][PB4SZuwUTwaJtp43iMIArg][CH1SCH060021739][inet[/10.46.217.50:9300]],[Elektro][ZsVkXHPHRJ-VCmvtbb_PHA][CH1SCH050051542][inet[/10.46.179.1:9300]],[Vashti][K2BO32gNTru8xUPqETOL-g][CH1SCH050063239][inet[/10.46.213.4:9300]],}

[2015-01-11 19:12:12,612][INFO ][cluster.service ] [Earthquake]
removed
{[Blackheart][yVhuAFprS3SwXZoKsfbEPg][CH1SCH060040527][inet[/10.46.150.20:9300]],},
reason:
zen-disco-node_failed([Blackheart][yVhuAFprS3SwXZoKsfbEPg][CH1SCH060040527][inet[/10.46.150.20:9300]]),
reason failed to ping, tried [3] times, each with maximum [30s] timeout

[2015-01-11 19:12:12,633][ERROR][cluster.action.shard ] [Earthquake]
unexpected failure during [shard-failed ([agora_v1][23],
node[g1znl166QaO84xJjsABsfQ], relocating [F9SdmzilSHiuib4jihKstw], [P],
s[RELOCATING]), reason [Failed to perform [indices:data/write/bulk[s]] on
replica, message
[SendRequestTransportException[[Klaw][inet[/10.46.158.143:9300]][indices:data/write/bulk[s][r]]];
nested: NodeNotConnectedException[[Klaw][inet[/10.46.158.143:9300]] Node
not connected]; ]]]
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: no
longer master. source: [shard-failed ([agora_v1][23],
node[g1znl166QaO84xJjsABsfQ], relocating [F9SdmzilSHiuib4jihKstw], [P],
s[RELOCATING]), reason [Failed to perform [indices:data/write/bulk[s]] on
replica, message
[SendRequestTransportException[[Klaw][inet[/10.46.158.143:9300]][indices:data/write/bulk[s][r]]];
nested: NodeNotConnectedException[[Klaw][inet[/10.46.158.143:9300]] Node
not connected]; ]]]
at
org.elasticsearch.cluster.ClusterStateUpdateTask.onNoLongerMaster(ClusterStateUpdateTask.java:53)
at
org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:324)
at
org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

On Friday, January 9, 2015 at 11:33:58 PM UTC+5:30, Darshat Shah wrote:

Hi Mark,
Thanks for the reply. I need to prototype and demonstrate at this scale
first to ensure feasibility. Once we've proven ES works for this use case
then its quite possible that we'd engage with support for production.

Regarding your questions:
*What version of ES, java? *
[Darshat]ES 1.4.1, JVM 1.8.0 (latest I found from Oracle for Win64)

*What are you using to monitor your cluster? *
Not much really. I tried installing marvel after we ran into issues, but
mostly looking at cat apis and indices apis.

*How many GB is that index? *
About 1GB for every million entries. We don't have string fields but many
numeric fields on which aggregations are needed. With 4.5billion docs, the
total index size was about 4.5 TB spread over the 98 nodes.

*Is it in one massive index? *
Yes one hell of an index if it works! We need aggregations over this data
across multiple fields

*How many GB in your data in total? *
It can be order of 30TB.

*Why do you have 2 replicas? *
Data loss is generally considered not ok. However for our next run, as you
suggest, we will start with 0 replica and update it after bulk load is
done.

*Are you searching while indexing, or just indexing the data? If it's the
latter then you might want to try disabling replica's and then setting the
index refresh rate to -1 for the index, insert your data, and then turn
refresh back on and then let the data index. That's best practice for large
amounts of indexing. *
Just indexing to get historical data up there. We did set the refresh to
-1 before we began upload.

*Also, consider dropping your bulk size down to 5K, that's generally
considered the upper limit for bulk API batches. *

We are going to attempt another upload with these changes. I also set the
index_buffer_size to 40% in case it helps.

On Friday, January 9, 2015 at 3:29:48 PM UTC+5:30, Mark Walkom wrote:

Honestly, with this sort of scale you should be thinking about support
(disclaimer: I work for Elasticsearch support).

However let's see what we can do;
What version of ES, java?
What are you using to monitor your cluster?
How many GB is that index?
Is it in one massive index?
How many GB in your data in total?
Why do you have 2 replicas?
Are you searching while indexing, or just indexing the data? If it's the
latter then you might want to try disabling replica's and then setting the
index refresh rate to -1 for the index, insert your data, and then turn
refresh back on and then let the data index. That's best practice for large
amounts of indexing.

Also, consider dropping your bulk size down to 5K, that's generally
considered the upper limit for bulk API batches.

On 9 January 2015 at 14:44, Darshat Shah dar...@gmail.com wrote:

Hi,
We have a 98 node cluster of ES with each node 32GB RAM. 16GB is
reserved for ES via config file. The index has 98 shards with 2 replicas.

On this cluster we are loading a large number of documents (when done it
would be about 10 billion). About 40million documents are generated per
hour and we are pre-loading several days worth of documents to prototype
how ES will scale, and its query performance.

Right now we are facing problems getting data pre-loaded. Indexing is
turned off. We use NEST client, with batch size of 10k. To speed up data
load, we distribute the hourly data to each of the 98 nodes to insert in
parallel. This worked ok for a few hours till we got 4.5B documents in the
cluster.

After that the cluster state went to red. The outstanding tasks CAT API
shows errors like below. CPU/Disk/Memory seems to be fine on the nodes.

Why are we getting these errors and is there a best practice? any help
greatly appreciated since this blocks prototyping ES for our use case.

thanks
Darshat

Sample errors:

source : shard-failed ([agora_v1][24],
node[00ihc1ToRiqMDJ1lou1Sig], [R],
s[INITIALIZING]),
reason [Failed to start shard, message
[RecoveryFailedException[[agora_v1][24]: Recovery
failed from [Shingen
Harada][RDAwqX9yRgud9f7YtZAJPg][CH1
SCH060051438][inet[/10.46.153.84:9300]] into
[Elfqueen][

00ihc1ToRiqMDJ1lou1Sig][CH1SCH050053435][inet[/10.46.182
.106:9300]]]; nested:
RemoteTransportException[[Shingen

Harada][inet[/10.46.153.84:9300]][internal:index/shard/r
ecovery/start_recovery]]; nested:
RecoveryEngineException[[agora_v1][24] Phase[1]
Execution failed]; nested:
RecoverFilesRecoveryException[[agora_v1][24]
Failed to
transfer [0] files with total size of [0b]];
nested: NoS

uchFileException[D:\app\ES.ElasticSearch_v010\elasticsea

rch-1.4.1\data\AP-elasticsearch\nodes\0\indices\agora_v1
\24\index\segments_6r]; ]]

AND

source : shard-failed ([agora_v1][95],
node[PUsHFCStRaecPA6MuvJV9g], [P],
s[INITIALIZING]),
reason [Failed to start shard, message

[IndexShardGatewayRecoveryException[[agora_v1][95]
failed to fetch index version after copying it
over];
nested: CorruptIndexException[[agora_v1][95]
Preexisting corrupted index
[corrupted_1wegvS7BSKSbOYQkX9zJSw] caused by:
CorruptIndexException[Read past EOF while reading
segment infos]
EOFException[read past EOF:
MMapIndexInput(path="D:\

app\ES.ElasticSearch_v010\elasticsearch-1.4.1\data\AP-el

asticsearch\nodes\0\indices\agora_v1\95\index\segments_1
1j")]
org.apache.lucene.index.CorruptIndexException:
Read
past EOF while reading segment infos
at
org.elasticsearch.index.store.Store.readSegmentsI
nfo(Store.java:127)
at
org.elasticsearch.index.store.Store.access$400(St
ore.java:80)
at
org.elasticsearch.index.store.Store$MetadataSnaps
hot.buildMetadata(Store.java:575)
---snip more stack trace-----

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0f24b939-2cba-41a9-8de8-49565f77e567%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0f24b939-2cba-41a9-8de8-49565f77e567%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/00582fb0-f62d-4fd4-ba11-0ef8b7a94ec3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

So an update on this. I set aside 7 dedicated master eligible nodes, and
ensured the data upload connection pool only round robins over the data
nodes. Updated batch size to 5k. With this, upload of about 6Billion
documents (5TB of data) went through in about 12 hours.

Additionally on windows platform there is an issue with setting heap size
via ES_HEAP_SIZE in powershell. I noticed frequent OOM on queries and turns
out the setting somehow hadn't taken effect during installation via my
powershell scripts. Went and fixed this on each node.

Now the cluster looks much more stable. Aggregation times are still a
little disappointing (20s or roundabouts for our scenarios) so it doesn't
meet our internal need to get it below 5s mark. I'm going to experiment
with turning off some nodes to see how more/less nodes affect aggregation
times.

On Monday, January 12, 2015 at 3:33:10 PM UTC+5:30, Darshat wrote:

Hi Mark,
Thanks for the reply. I need to prototype and demonstrate at this scale
first to ensure feasibility. Once we've proven ES works for this use case
then its quite possible that we'd engage with support for production.

Regarding your questions:
What version of ES, java?
[Darshat]ES 1.4.1, JVM 1.8.0 (latest I found from Oracle for Win64)

What are you using to monitor your cluster?

Not much really. I tried installing marvel after we ran into issues, but
mostly looking at cat apis and indices apis.

How many GB is that index?

About 1GB for every million entries. We don't have string fields but many
numeric fields on which aggregations are needed. With 4.5billion docs, the
total index size was about 4.5 TB spread over the 98 nodes.

Is it in one massive index?

Yes. We need aggregations over this data across multiple fields

How many GB in your data in total?

It can be order of 30TB.

Why do you have 2 replicas?

Data loss is generally considered not ok. However for our next run, as you
suggest, we will start with 0 replica and update it after bulk load is
done.

Are you searching while indexing, or just indexing the data? If it's the
latter then you might want to try disabling replica's and then setting the
index refresh rate to -1 for the index, insert your data, and then turn
refresh back on and then let the data index. That's best practice for
large
amounts of indexing.

Just indexing to get historical data up there. We did set the refresh to
-1
before we began upload.

Also, consider dropping your bulk size down to 5K, that's generally
considered the upper limit for bulk API batches.

We are going to attempt another upload with these changes. I also set the
index_buffer_size to 40% in case it helps.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/corruption-when-indexing-large-number-of-documents-4-billion-tp4068743p4068787.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/956440a4-00f1-4027-a9cb-11f2d0a5de47%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.