We are building an ES index to archive billions of docs. The index are
distributed across 4 servers, each with 4 cpus, each with 32G memory. And
we splitted the index into 32 shards.
When we just began to index two weeks ago, the performance was like 6K
docs each second. Now, as the index gets big, with 3Billion docs indexed
already, total disk space is about 1.2T. The indexing speed decreases to
300 docs/second.
Initially we set refresh_interval to 120s, but in the past two days, the
ES server randomly dropped one shard and the cluster health status became
red. We had to decrease the refresh_interval lower to make the cluster
stable. When the cluster dropped one shard, the log has no any error info.
Do you use monitoring tools, for heap/CPU/GC activity? What is the max heap
size? Is heap exhausted? How large are the segments - is merging the issue?
yes. we use bigdesk to monitor the cluster. Heap size allocated is 16G
each node, and it was not exhausted. Largest .fdt file for each shard is
like 2G.
I just found that when a shard, say shard 17, got dropped, if I use manual
allocation to put it back with this command
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands":[{"allocate":
{"index": "boards", "shard":17, "node":"6EuzZatFRTK6q6F2boSOZw",
"allow_primary" : true}}]}'
Do you use monitoring tools, for heap/CPU/GC activity? What is the max
heap size? Is heap exhausted? How large are the segments - is merging the
issue?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.