One elastic node fills up prior to the others?

Am I missing a policy or something? I have a 8 node cluster. 3k Winlogbeat agents reporting in. I ran winlogbeat setup -e to get the template installed. I've reviewed logs and nothing related to allocation seems to be in the logs. I can't even find a setting that would dictate why one randomn node gets filled up faster.

Any tips on where to begin? I'm self taught in elk and haven't been able to overcome whatever config I have overlooked this time.

Do all nodes have the same amount of disk space allocated?
Do they have a similar number of shards allocated to them?

Yes. I've mounted a 200 GB disk to every Ubuntu host and its where the data folder resides. The only node that is different is the first node which I created before I designed the cluster (I wiped data since then but kept the 1tb disk assigned).

All the nodes have a similar amount of shards allocated, maybe differentiating by +/- 1. Or at least till the watermark kicks in on the bad node.

Interesting...

What size are your shards on your biggest indices?

Can you run a

GET _cat/nodes/?v&h=name,du,dt,dup

and

GET _cat/indices/?bytes=b&s=store.size:desc&v=true

name           du       dt   dup
elastic06  10.7gb  195.8gb  5.51
elastic05  18.8gb  195.8gb  9.64
elastic08    11gb  195.8gb  5.63
elastic02 101.5gb  195.8gb 51.86
elastic01  51.7gb 1006.9gb  5.14
elastic07  19.4gb  195.8gb  9.93
elastic03 102.1gb  195.8gb 52.15
elastic04  11.1gb  195.8gb  5.67

Other api request is lengthly, https://pastebin.com/3zpKbgHv. I linked it here as pasting it in discuss was gross.

Could this be tied to a missing policy of some sort. I never setup rollover properly but I don't think that should affect allocation.

Can you run the cat indices as well I am pretty sure I know what is going on but want to see that first.

What you showed is the cat nodes

Ohh I see the past bin...let me look.

Hi @Jcroy

What version of elasticsearch are you on?

Can you please run

GET _ilm/policy/winlogbeat-7.9.1 <----- this should govern the rolling over of indices...

and lets double check the shards

GET _cat/shards/winlogbeat*?v

This appears to have nothing to do with routing / allocation and everything to do with shard size so yes it look like you you are missing an ILM policy or it is not being used or the settings are incorrect.

In short your winlogbeat indices has 1 primary and 1 replica shard ... each shard is ~100GB. (That is about double what we recommend for starters)

You are seeing is 1 x 100GB shards on nodes 2 and 3 from index winlogbeat-7.9.1-2021.14 each of those shards take up about 50% of the storage on each node themselves. To be clear a shard is atomic it can live only on 1 node.

I am not sure of your index naming convention means 2021.14 (why you changed it, what you are trying to accomplish...weekly?, I am sure you had a method in mind) , and what settings you have and have not set or changed.

Using all the defaults, the shard size would have been 50GB per shard and the indices would have automatically roll over at that size. And you would get a more even distribution.

So ... what to do?

You could go back to defaults for the the indices names etc which should use the default ILM policy . This will results in 50GB shards. This assumes at 1 point you ran winlogbeat setup

2nd as an observation having 9 nodes with only 200GB (perhaps you intend to grow) of storage on each node for a observability cluster is a bit odd, when shards are 50GB (the default) it is still going to fill up 50GB for each shard so some nodes are going to fill up to that 50GB before the next shard on another node gets started.

You can create your own ILM and set shards to say like 20GB for rollover if you want a more even distribution.

You could go to some sort of Daily or Weekly indices scheme

If you create your ILM policy or edit the existing you will need to update the template to use your ILM. Then you need to be careful to not overwrite it.

See here:

setup.template.enabled

Set to false to disable template loading. If this is set to false, you must load the template manually.

You could even break down that big index if you want but that would take some time.

Think about what you want to do and perhaps we can help you get there.

How Many Shards and How to Size a Custer is some good reading on the topics

So I was trying to change the index interval from daily to weekly when I first set it up hence the 2021.14 etc. Logstash sends it to elastic with a weekly naming number.

Output of GET _ilm/policy/winlogbeat:

{
"winlogbeat" : {
"version" : 2,
"modified_date" : "2021-03-31T16:08:21.398Z",
"policy" : {
"phases" : {
"hot" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_size" : "50gb",
"max_age" : "30d"
}
}
},
"cold" : {
"min_age" : "90d",
"actions" : {
"set_priority" : {
"priority" : 0
}
}
}
}
}
}
}

And output of `GET _cat/shards/winlogbeat*?v:

index shard prirep state docs store ip node
winlogbeat-7.9.1-2021.14 0 p STARTED 361842642 169.4gb xxxxx elastic02
winlogbeat-7.9.1-2021.14 0 r STARTED 361842642 169.9gb xxxxx elastic01
winlogbeat-7.9.1-2009.53 0 r STARTED 73 108.7kb xxxxx elastic04
winlogbeat-7.9.1-2009.53 0 p STARTED 73 108.7kb xxxxx elastic01
winlogbeat-7.9.1-2021.11 0 p STARTED 11272 4.2mb xxxxx elastic03
winlogbeat-7.9.1-2021.11 0 r STARTED 11272 4.1mb xxxxx elastic08
winlogbeat-7.9.1-2021.13 0 p STARTED 16873905 7.8gb xxxxx elastic05
winlogbeat-7.9.1-2021.13 0 r STARTED 16873905 7.9gb xxxxx elastic07
winlogbeat-7.9.1-2021.12 0 p STARTED 118888 53.8mb xxxxx elastic03
winlogbeat-7.9.1-2021.12 0 r STARTED 118888 53.7mb xxxxx elastic01

I also plan on adding more storage down the road. I'm trying to clean up the config/get everything working as designed before I worry about the space requirements.

Ok good thanks that pretty much confirms what I was thinking.

What I would suggest (aka highly recommend) is get out of the mindset of trying to do time based rollover and just use the defaults that comes with the default ILM policy. The default policy above is Size based not time based although it has a 30 day back stop (which won't really matter for you)

We have put a lot of thought into the best practices / defaults. That is represented by the ILM policy above which is 50GB per shard or 30 Days / Per Index so whether that is 1 days of data 1 week or 1 Month, let the ILM policy do that for you.

Each of these will be about 50GB / Shard in this case it took a couple days to get to 50GB (my example)

These indices names are fully auto generated with the date of the first doc and the 000001 Is the ILM sequence

winlogbeat-7.12.0-2021.04.08-000001
winlogbeat-7.12.0-2021.04.10-000002

When you try that weekly it ignores the ILM policy. So it It looks like when you tried to do weekly but that week ended up with 170GB Shards... that is not good.

Also trying to time based is hard as you bring on more hosts then the indices / shards will get bigger... or if you have a busy week you will have bigger shards ... or a slow week then they are small.

You can even see that in your indices above ... Each week is radically different sizes.... not optimum for sure.

So to use the defaults take out of the indices name stuff in your config. If you really want to adjust go in through Kibana and adjust the default winlogbeat ILM say to 30GB (that is a little less efficient) but if you want smaller shards).

I had not asked if your architecture is this. Then take just simply point at elasticsearch and take out all the other stuff. If you share your winlogbeat.yml we can take a look.

winlogbeat -> Elasticsearch

Or is there something in between? Like

winlogbeat -> Logstash -> Elasticsearch

Let me know.. but that is certainly my suggestions ... Use Size Based.

We can even help you break down that massive one if needed but not sure if that is really important for you or not (or you could just get rid of).

So the data ingestion begins with beats agents shipping data to logstash. Then obviously its passed onto elastic. Not sure if there is any benefit of going directly to elastic from beats so I threw in logstash as I plan on adding syslog down the rode/want to be flexible for the future.

I originally setup the conf.d file like elastic.co lists on this link. Then I changed it to weekly thinking that would help. I'm going to change it back and setup a policy as you stated.

If I set the index line to something like " index => "%{[@metadata][beat]}-%{[@metadata][version]}". It would fall under the policy? Did I understand you correctly? Ideally itd rollover after a size limit and move onto a archiving phase after parameters are met. Thanks for the insight its truly appreciated, its be fun learning elastic but an constant uphill challenge.

Beats -> Logstash -> Elasticsearch
is a perfectly fine / excellent / flexible / commonly used architecture

For the above architecture to use all the defaults here is the default / simple pipeline.

Please keep in mind the gotcha if you put all the pipelines in the conf.d directory then all get concatenated so if you put that syslog.conf in there they will get mixed. Once you do that you should use the pipelines.yml to keep them separate.

################################################
# beats->logstash->es default config.
################################################
input {
  beats {
    port => 5044
  }
}

output {
  if [@metadata][pipeline] {
    elasticsearch {
      hosts => "http://localhost:9200"
      manage_template => false
      index => "%{[@metadata][beat]}-%{[@metadata][version]}"
      pipeline => "%{[@metadata][pipeline]}" 
      user => "elastic"
      password => "secret"
    }
  } else {
    elasticsearch {
      hosts => "http://localhost:9200"
      manage_template => false
      index => "%{[@metadata][beat]}-%{[@metadata][version]}"
      user => "elastic"
      password => "secret"
    }
  }
}