ELK stack questions

nkoleff · July 22, 2015, 6:12am

After a month of testing the ELK stack I'm planning to move it on production. Now I'm facing a few dillemas.

The plan is to index around 30GB of data per day (mainly log files) and my questions are:

Lets say i want to use 3 nodes for my setup. Can 2 of the nodes hold 4 primary shards and the third to be a replica and acting as a hot backup?
I plan to compress the _source field and if possible some of the other fields. Is that compression slowing down the searches?
30GB is obviously too much, if the compression works well, I could be able to shrink the data to 15GB/day. However, this means 450GB per month. How much a data in a snapshot for a month back will be compressed?
Can Kibana search in ccompressed snapshot data at all?

magnusbaeck · July 22, 2015, 8:32am

This thread is rather offtopic for the Development group where it was posted. Anyway...

Elasticsearch doesn't have a native concept of hot backup nodes. It's possible that you can configure it to place only replica shards on the third node and avoid querying those shards, but it's probably too much work. What problem are you trying to solve?
I actually thought it was compressed by default. The source document isn't normally used for searches (as long as the fields are stored) so it shouldn't slow down too much.
Why would 30 GB "obviously" be too much? If the size of the raw log files is 30 GB it's totally possible that this'll require more than 30 GB of ES storage even with compression, depending on how you analyze the logs. Not sure what you mean by your last question about snapshot for a month back.
Are you talking about ES snapshots, mainly used for backups? If so the answer is no. Data in on-disk snapshots are not available to ES and consequently not Kibana either. If you're talking about something else the answer is that any data that's online in ES is also available to Kibana.

nkoleff · July 22, 2015, 8:53am

Continuing the discussion from ELK stack questions:

magnusbaeck:

This thread is rather offtopic for the Development group where it was posted. Anyway...

Elasticsearch doesn't have a native concept of hot backup nodes. It's possible that you can configure it to place only replica shards on the third node and avoid querying those shards, but it's probably too much work. What problem are you trying to solve?

I actually thought it was compressed by default. The source document isn't normally used for searches (as long as the fields are stored) so it shouldn't slow down too much.

Why would 30 GB "obviously" be too much? If the size of the raw log files is 30 GB it's totally possible that this'll require more than 30 GB of ES storage even with compression, depending on how you analyze the logs. Not sure what you mean by your last question about snapshot for a month back.

Are you talking about ES snapshots, mainly used for backups? If so the answer is no. Data in on-disk snapshots are not available to ES and consequently not Kibana either. If you're talking about something else the answer is that any data that's online in ES is also available to Kibana.

First I would like to say sorry for posting in the wrong category, I wasn't sure where to do it.

Let me answer your questions:

I would like to use the two primary nodes for searches in the logs for the last three months. The idea of the third one is to act as a backup if something goes wrong with the first two + for searches on indexes that are older than 3 months.
_source yes, but fields like _message - i am not sure. There will be also custom fields with longer text and I'm wondering if they are compressed the search will be slower than if they are not compressed.
It's obviously too much, because 30x30 means nearly 1 TB per month. This would result in ~12TB per year raw logs, and keeping in mind that we must hold logs for AT LEAST one year back, well, its obviously too much. At the moment, I am solving this with gzipping data older than 7 days, which does a good job. However, I am not sure how can I achieve such result in elasticsearch.
The idea with ES snapshots came up in my mind when I was searching a way to shrink some data older than X months. Let's say that a user wants to search for a data older than 3 months - I would like to route it to the replica shard and somehow give him the ability to search thru these snapshots, without too much efforts.

Imagine the following scenario: User comes to search for data older than 3 months. He goes to Kibana and he gets routed to the replica shard and the replica shard on the other hand "mounts" the snapshot where this data is placed.

magnusbaeck · July 22, 2015, 9:51am

With just three nodes I suspect this is premature optimization that leaves the third node mostly idle while the first two are working hard.
This is getting into territory that I'm not too familiar with.
Well, if you want real-time searches of data this is the price you pay. 30 GB/day does indeed equal about 12 TB per year.
You'd still have to pay the disk penalty for the snapshot itself so wouldn't the gain be rather limited in terms of disk usage (which seems to be your primary concern)? But sure, not having the indexes online would certainly reduce the heap usage. Anyway, this is something you'd have to build yourself.

Topic		Replies	Views
Experiences in "how to manage much data" needed Elasticsearch	8	549	August 10, 2018
Setting up Multi-node Architecture of ELK for log monitoring Elasticsearch	6	744	June 10, 2019
Reducing Disk Space Requirements/ Deduplication? Zipping? Elasticsearch	5	2255	July 6, 2017
Sizing ELK cluster for 12GB daily logs Elasticsearch	3	794	December 25, 2018
Need advice on building a new production ELK cluster Elasticsearch	2	142	March 28, 2024

ELK stack questions

Related topics