Adding another instance as a means to expand writable space

fallenreaper · February 20, 2020, 4:55am

we all run into the issue of running out of space on your initial ES instance. I launched a second and it spooled up but when i look at the summary for my instance it says:
1 primary and 1 replica.

If i recall, my first instance was a combined master/data node. My second one was just a data node.

I was hoping that by adding this second instance, which points to another 4 TB of space to use, it would fill that up slowly.

What I think i am observing is that the 2 instances arent acting as such. I think the second instance is a clone of the first. I noticed they had the same size.

What did I do wrong? I presume it would try to balance the data, maybe shift some from A to B while continuing to index, which frees up space on A. I havent noticed that happening though.

Is it because 1 is the Master/Data Combo and the other is Data?
I am in Kibana at the moment, and i was was looking at the management for the index, and im thinking of how I could edit it. I didnt see in the index management though that I had 2 ES instances up and running other than the Summary telling me 1 primary and 1 replica.

Also, I am not sure if this means anything, but it seems like for 1.4B records at the moment, i have only 1 shard, which i dont think is right. I have a 270G ( x 2 because of the above replica) shard. Seems that according to documentation, I should have shards that are under 40G or so. The issue I noticed though is that a shard is defined on init, so if i wanted to change this, I need to spin up a new index and migrate it all over. My issue is that I dont know how much my ingestion when complete will take up, so i cant necessarily backpedal to say: X/40g = Number of shards I need.

As a follow-up, should I do a single shard, full ingestion to find the space it takes, and then recreate the index with the right shards and then do a full dump from one index to another?

spinscale · February 20, 2020, 8:13am

Hey,

if you configure an index to have a replica, then somewhere in the cluster a copy of your data is made. However with a single node cluster, it does not make sense to have a copy on the same node, as this node being removed from the cluster would not provide any fallback.

Adding a second node to your cluster in that case just ensures there is a copy of everything on that second node.

Either you add a third node to have data moved of the two other nodes or you can live with the fact of having no copies of your data available so that each of the two nodes store different sets of data.

Also, if you only have one index with 1 shard, this single shard won't be split up further even if you add more nodes. A shard is basically the unit of scale with regards of data within your cluster.

Hope this helps.

fallenreaper · February 20, 2020, 2:20pm

i was under the impression from some of the videos on the site that i could turn off replication for the time being. So i could just use 2+ nodes as additional storage / processing.

It sounds like a shard is a manageable chunk of data. So since i have my index only have one shard and it is static from init, I should likely figure out how large the data is going to be and then back plan to figure out an approximation of the shards I will need? I was noticing that there was a possible parallel between shards and the ram or heap size on the server.

Since this is my first FULL ingestion of data, i am trying to figure out the full size of what it would be unless i should just approximate things. I figure that 3TB of data will likely be produced from 2TB of raw log info (over estimate) so i should consider 3TB / 40G => Shard Count.

After I do this full ingestion though . (its been taking days, ~2m records every 5 minutes or so), can i stand up a new ES Instance, and then essentially PIPE all the data into it?

I was not sure if i want to back up all the data after the i implement sharding or not as I am not sure if restoring is just data or data and instance definitions.

spinscale · February 20, 2020, 2:25pm

Indeed, you can turn of replication, by setting the number of replicas to 0.

you should definately consider splitting up your index in more shards, if one shard has all four terabytes. You can take a look at the split index API

fallenreaper · February 20, 2020, 6:14pm

Oh so that is built in? I thought it was only able to be set on Init. If i can shard an instance while it is running that would be fantastic. Yeah, I think i am going to turn off replica and then shard it.

Is this doable from within kibana as well in the index settings, or should I just run the curl request from my terminal?

Based on my information, I was thinking 40 shards. 40*40 = 1.600 TB approx. Doesnt sound too bad.

@spinscale When looking at the API though, it doesnt do a relief in place it seems. It requires me to swap it to a readonline mode before doing this, and will create a new index in the process. Is there not a way to say: "On, shards set to 40. ok, process all incoming data as such and back process all existing data in the background" to allow ingestion while updating shard counts?

I have been watching more videos with regard to design, so I am trying to think on whether I should have this as a single shard until ingestion is complete, as there is a few weeks of planned downtime. I have been wrapping my head about whether or not this would adversely affect ingestion.... and if it is 2 nodes, does that mean that the shard would consume both nodes until i do this resharding/reindex?

fallenreaper · February 20, 2020, 6:41pm

It seems like I may be able to turn things to readonly. Set this to start the sharding process in the new index, and then after the shards are allocated, kill all my instances of logstash and update ingestion output to this new index, and turn them all back on.

That way it would keep ingesting, while slowly migrating over to this new index. The only issue I can foresee is that the old index isnt deleted, as this isnt a relief in place. I will have to wait until the reindex is complete before I can delete it, which subsequently means ill have to wait awhile to free up all of the data. This can be a problem if disk space or more importantly ssd space is limited.

Is there a way to also just point the alias of the index? So i dont have to touch any of the logstash machines? It will submit it to Foo, but ES will see that and just pipe it into the new index?

it looks as though i can create an alias on Elasticsearch, but im curious if that would cause issues because the alias is in fact a current index, just in readonly

fallenreaper · February 20, 2020, 6:54pm

I think that since I am going on vacation for the next 3 weeks, Ill let ingestion continue at the moment even though it is a single shard. Then after I return my phase 1 of ingestion would likely be finished. This would allow me to begin a migration prior to phase 2 which is my next batch of files. My largest worry is that with 1 shared, and only 8g of ram allocated to each node, that there could be computational issues related to a single shard.

system · March 19, 2020, 6:54pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Clarifications regarding # of shards / shard replication Elasticsearch	2	493	July 6, 2017
Increasing Size of Disk on Virtual Machine Elasticsearch	3	2707	July 6, 2017
Heap (Shard Amount and Close Index) Elasticsearch	4	615	September 14, 2020
Should Data Nodes still be the same size? Elasticsearch	3	1004	March 23, 2022
Elasticsearch performance improvment Elasticsearch	3	306	July 6, 2017

Adding another instance as a means to expand writable space

Related topics