Concept questions index names multiple customers and archive

Hi
I'm very new in ELK and read a lot about.
I like this very much and will go deeper and deeper for better understanding.
Now I've build a lab and test ELK for log management.

In our company we have a lot of dedicated customers.
My use case to test is that I want save the windows eventlogs for security to have a central place, auditing and archiving this logs.

OK I know that the following example I used the wrong beat, but I need this for understanding only.

At this time my index names created from filebeat-plugin are filebeat-2018-07-10, filebeat-2018-07-12... how I accomplish that the index has the name of the customer like filebeat-customer1-2018-07-10, or filebeat-windows-customer-2018-07-10...
Or I'm wrong with that concept for later delete process to delete all data of an customer... If I'm wrong it would great that you explain me how to accomplish this scenario.
I tested the tags and additional fields (with customer name for that) but I read that curator delete by index name and not inside of an index by tags or field-values.

The other question ist how to archive the logs.
My idea is that I config that I have 90days for research and after that time the logs will be archived (2years retention time) for a later research if it's needed.

Sorry for the newbie questions but I stuck right now for that and concept and understanding is everything before starting in production :slight_smile:

I took the liberty of combing through and extracting a few relevant bits here:

Shard count per node matters

If you do, indeed, have lots of customers, and you're creating a new index per customer per day, you might wind up with more shards on each box than you can effectively maintain, especially if you want to keep 90 days worth of indices active. If you have a single shard plus 1 replica, one customer means 180 shards over a 90 day period. Multiply that out and you can see where that goes. The "safe" shard count for a single node with a 31G heap is approximately 600 (which fluctuates depending on the amount of querying, and what kinds of queries are being performed). While I've seen single nodes with well over 2,000 shards, these all end up suffering terrible performance for all cluster and querying metrics. These symptoms start to manifest after the 600 shard range. If your heap is less than 31G, then that 600 number drops right along with it.

I present these for your consideration because if you were not aware of these things, you would have a healthy, fully functioning cluster for the first few weeks, and then performance would start to go downhill. Chances are good you would have no idea why, at that point.

Curator

You are correct that Curator acts on whole indices, and not the documents stored inside. You can also use Curator to perform snapshots of your data for archival purposes.

1 Like

Thank you aaron for your help with that
Can you give me a suggestion how I could achieve the goal.

Indices per week?
Like filebeat-2018-27 would help

Or is my goal not achievable with elasticsearch
Thank you very much for details
I appreciate your time

Even if you go with monthly indices you will have at least 8 shards (4 months with 1 replica to ensure 90 days available) per customer. This will scale better than your daily indices but still not be able to accommodate very large number of customers.

How many customers do you expect to have? How much data will each of them generate per month?

Searching for a solution for central log management the environment is the following:
800 servers with ratio 90/10 windows/linux and about 40 appliances with syslog data devided up in 30 customers

Security eventlogs with auditing activated in activedirectory, maybe more eventlogs if this is possible (application, system, sysmon)
Auditd logs and syslog data in linux
Syslog from the appliances
Log data from shared services like antivirus

There‘s a lot of data :slight_smile:

How to best organise indices will depend on the expected number of customers and data volumes. Setting up a cluster to handle 10 customers with large data volumes may require a different strategy compared to supporting 10000 customers each with a small amount of data.

Are all customers expected to be of similar size or will you have a few large ones and a larger number of small ones?

Normal is 2 DCs per customers but some are greater
Have about 5-10 large customers with lot more data

If you have a limited number of customers with large amounts of data having indices per customer makes sense. Make sure you do not create a lot of small indices though as that is inefficient. In order to make sure you get shards in or around an ideal size (you will need to determine what this is) you can use the rollover index API. This makes it possible to create new indices in the background as soon as the previous one reaches a target size, so makes it a lot easier to cater for customers with different volumes.

You could e.g. set that the index should have 2 primary shards and roll over as soon as the shard size is estimated to be e.g. 20GB or the index has been open longer than a week or two. For customers with lower volumes it is likely that each index will cover the maximum time period while indices will roll over more frequently based on size for large customers. This allows you to minimize the number of shards in the cluster.

And Curator also supports the Rollover API, so you could still use it to meet that need.

Thank you very much Aaron and Christian

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.