Can Elasticsearch aggregate logs in a "live" manner and meet our needs at scale?

nsouth · August 24, 2021, 1:06pm

My team is new to Elasticsearch and is considering it for log analysis. I need to be able to tell my manager whether the technology can scale to meet our needs. So that's my question: Can Elasticsearch meet our goals at scale?

Our requirements:

Ingest logs from ~150 servers, each generating ~2GB of logs per day
Events are interlaced in the logs and need to be aggregated (see below example).
We would like a live view of the events for an operations team to monitor. If only the start line of an event has come through, in Kibana, we want the event to show a duration field that is the difference between now and the start time. As other lines from the event come through, the event should be updated with more detail. Once the end line has come through, the duration should become the difference between the end time and the start time.

Example log format:

sameTaskId, start, username,size
sameTaskId,networkstats
anotherTaskId, start, username,size
sameTaskId, end, authAuditInfo
anotherTaskId,networkstats
anotherTaskId, end, authAuditInfo

Approach:
We have considered the following

Logstash aggregate filter - I don't think this will work because it requires limiting to a single worker thread. This means we can't easily scale. Also, it would hold onto the event until the end came through, so we wouldn't get the "live" view we want.
Logstash Elasticsearch filter - Logstash could look up the start event and update it when the end line comes through. But what if they are received out of order? I assume the lookup would fail. This approach is also very chatty.
Elasticsearch continuous transforms - This is what I'm leaning toward. It sounds like it doesn't have the same inherent scaling concerns as the other options. The main concern is how "live" the data can be in Kibana.

Primary Question: Can Elasticsearch meet our goals?

Additional Questions:

Can continuous transforms keep up with the data and provide the "live" view we want?
Some events are only milliseconds long. Will transforms handle the case where the end line arrives before the start line?

Hendrik_Muhs · August 24, 2021, 2:33pm

Answering a scaling/sizing question is always hard. As a distributed system I think this answer is not if but how many resources are required for it. Even with the given information I can't answer your question as their are many parameters to consider.

However, we have users that use transform at scale with similar requirements.

Transform is "as fast as search and indexing can go", so again this is more a resource question. When it comes to live view it depends on what you consider as live. Elasticsearch itself is a near realtime search engine. With the out of the box defaults and index refreshes every 1s, that means a data point you push in becomes available after 1s max. You can tweak this setting, but this will cause more load on your cluster, because internally more flushes are required.

On top of that transform queries your source index and writes the results. Worst case you therefore need 1s (source index refresh) + 1s frequency + 1s (dest index refresh) until you see a result in your life view. However, I think such a setting with 3s latency is quite expensive.

So my question: What does live view mean for you, what is the SLA you expect?

Out of order as well as interlaced lines isn't a problem for transform. For calculating a duration with have an example in our docs.

nsouth · August 24, 2021, 3:05pm

Thank you. And I know that this is a nuanced question and that there's no quick and clear answer. Still, I appreciate the input you've given. At the end of the day, my goal is to determine whether this is the right technology stack for our requirements and to be able to defend my position to management before we commit resources.

Let's define "live" as "within 30 seconds." Obviously, something more like 5 seconds would be great, but I think there's some wiggle room here. Do you think this is reasonable?

My follow-up questions will be:

How many transform nodes do you think this would take? Are we talking 1-2 nodes or like 8+ nodes?
Are transform nodes available in Elastic Cloud plans? I don't see these listed as separate node types in the cloud pricing calculator.

Hendrik_Muhs · August 24, 2021, 4:05pm

Thanks for the update, this sounds more reasonable. 30s sounds definitely doable.

The main work happens at search and indexing, the transform task itself is just a coordinator which takes the search results and creates indexing requests. With other words it is more important to focus on search and indexing performance. For search it is important you setup your log ingest correctly. So take a step back and get sharding right for data size and ingest rate you have. Ensure search can run at scale. Both will help transform to run at scale, in a nutshell transform is just a search client. Technically a search executes in concurrently on every shard. That means if you have an index with 5 shards, search will use 5 threads, potentially distributed if these shards are on different nodes. Of course finally the internal searches will be reduced to 1 result set. Similarly indexing gets executed in parallel for every shard. We have another scaling guide for transform. It probably make sense to create a destination index with >1 shard for your use case. In addition spend some time on the query, e.g. if you know that a task does not take longer than 1 hour it will help to specify a query that limits the time range to e.g. now-1h/m.

In summary, you don't need several transform nodes. Transform is not pulling raw data and processing it, but it uses search and aggregations. Technically speaking again, the data is processed and reduced locally on the data node that stores the data and only the reduced results from the shards are combined on the search/coordinating node. Therefore ensure that your data nodes are fast.

There is a transform node role you can attach to any elasticsearch node. Per default the transform node role is applied to all data nodes. You can customize and e.g. avoid running the transform on hot nodes. However as said, the transform process itself is lightweight compared to the search and indexing work which happens where the data is or will be stored.

I think transform is able to meet your requirements. But before starting your endeavor regarding transform make the 1st step setting up a logs cluster with elasticsearch to ingest, store and search your log data.

nsouth · August 24, 2021, 4:52pm

Thanks for the detailed reply! Really, I feel like I'm in good hands.

This is encouraging and it gives me confidence to proceed. I will start with getting our data into Elasticsearch and will start playing around with your suggested optimizations. I'll reach out with more questions if I have them, but I think this satisfies my concerns for now

system · September 21, 2021, 4:52pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Concerns with scaling Logstash when using the Aggregate filter Logstash	6	648	August 27, 2021
Log management with Elasticsearch? Elasticsearch	5	348	July 6, 2017
Case Studies Wanted Elasticsearch	1	287	July 6, 2017
Continous transforms accumulating delay, can be tweak for speed up? Elasticsearch	12	1543	August 3, 2020
Alternative outputs to run alongside Elasticsearch Logstash	2	1141	July 6, 2017

Can Elasticsearch aggregate logs in a "live" manner and meet our needs at scale?

Related topics