Handle 150.000 logs per sec

Hi all,
I'm here to ask for help about the sizing of a Elasticsearch cluster. I have to index about 5TB of logs a day (from 7am to about 7pm) with 7 days of retention. There are about 150.000 events per second divided in 100.000 machines with dozens of different log types.

Assumptions:

  1. About 50 concurrent users that use Kibana;
  2. Computation of scripted fields required (I think with Regex too);
  3. Total space required: about 70Tb;

Questions are:

  1. I think that are required about 15 nodes with 16core, 64ram and 5Tb where every node handles about 10.000 logs per second. What do you think?
  2. With this scenario, which is the best Filebeat/Logstash architecture? Have I to use a queue?

Thanks.
Kind regards.

Often you are better off building the field using an ingest processor or logstash or something.

Without any replicas.

Often 128 or more is the sweet spot on ram, at least that is what I've heard lately.

SSDs index faster and 5TB of SSD is pretty pricy. So you may want to look at hot/warm and the rollover and shrink APIs.

What is your hardware procurement process like? If possible it tends to be useful to experiment with one or a couple of nodes and see what rates you can get out of them. Then scale up. Experimenting is always going to be important because we don't know what kind of analysis you want to do, what kind of documents you have, and what the script fields are like.

You are often better off making an index per log type. You should have to worry about the number of indexes being too large with only 7 days of retention.

Yeah, I know, but one of the requirements is to compute at query-time some new fields, for new analysis. When there are thousand of possibile log patterns with maybe custom messages and thousand machines, it's not so easy to extract all the possibile fields at index-time. So I need to use scripted fields, for example to extract a substring from a field and so on.

Why? 5Tb * 7 = 35Tb a week and with replica = 70Tb.
Do you mean Elasticsearch uses only half of a disk?

Good point.

I have not these info yet.

How many logs can handle a single index?
My idea is to use Log-Type AND Day-Time indexes.

Thank you for your help :slight_smile:

Yeah, you are right. I would make sure not to cut it close though - Elasticsearch gets upset if the disk is more than 85% full.

The hard limit is somewhere around two billion documents. But you are more likely to want to have smaller indexes just for managing recoveries. Smaller indexes recover faster because they are less bytes to through around the network. Sure, the total number of bytes is the same, but with smaller indexes you get incremental progress.

Have a look at rollover/shrink. You have enough traffic that it is really worth looking at. In that case you'd have an index per type per time period. With rollover making a new index every once in a while.

You honestly might be better of with an even number of nodes so you can put one shard on each one to spread out write load. Then use rollover/shrink to turn all those shards into one when you are done with them. though you probably also want one node's worth of slack so you can recover to it if another node goes down.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.