We are doing a POC to analyse and visualize 64 million records (175 colomns) of data on a 3 node cluster. Below are the settings:
Each node capacity: CPU: 10 cores; RAM: 64Gig; Storage: 150GB
I am using logstash to ingest data from Netezza to Elasticsearch. After the logstash is started, we have updated replica of the index to 0 as this is a test. We are getting lot of low disk errors even though the total cluster storage is 450 GB (total storage of cluster). individual node disk space is 150GB. I have checked the index size for instance 12.5 million records took around 50GB of space on single node.
I have the following questions:
Can you please let us know how the storage is split across cluster? For Example, if the disk space on one node is low, will elasticsearch stores the data in other nodes?
Any better recommendations around ingesting the data into the cluster.
What the next steps in case the node goes read only mode.
Each index can have multiple shards. Those shards can be balanced across all of your nodes automatically. If you're running out of space, I'd suspect you have a single shard taking up too much space on a single node. Do you have multiple indexes and/or several shards per index configured?
Some documentation on shard balancing:
My summary of this:
watermark.low - I can take more shards
watermark.high - I'm not taking any more shards
watermark.flood_stage - I'm going to move my shards to other nodes
I struggle with this at times too. I've found that Cerebro is a handy tool to troubleshoot issues. Under More->Cluster settings, there are some read-only switches that you might be enabled when your disk fills.
As per elastic, Default for number_of_shards is 5, Default for number_of_replicas is 1 (ie one replica for each primary shard).
That means i have 5 shards and not multiple indexes. I am creating only one index (400 gig) because I have a requirement to visualize full data. We cannot create visualization referring to multiple indexes right? For ex: Dec 2019 data in one index and Nov 2019 data in another index then create a visualization for Nov and Dev 2019 month with different indexes. This is not possible right?
Also, my Kibana is down with apm plugin not available error suddenly not sure why but node2 and node3 are up and running whereas node 1 is down due to disk full issues.
Appreciate your help.
These are because the disk space is full on node1. I have been loading 64 million records to elasticsearch and my understanding is if the disk space on node 1 is 85%, elasticsearch will relocate the data to multiple other nodes. Not sure what settings i need to do to acheive this.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.