Experiences in "how to manage much data" needed

MarcusCaepio · July 13, 2018, 8:08am

Hi all,
the following situation is given:
ELK: 6.3
Cluster: 8 Data nodes, each of them with ~24TB dedicated Elastic datastore
Daily Data: ~250GB Primary Data, ~520.000.000 Documents
The indices are written daily, index template is currently set to 8 primary shards (31,5 GB per shard) with 3 replicas for search performance.
Current problem: If my calculations are correct, I can save the data for round about 192 days. But it is necessary to store them longer. Closing indices is not an option, as there could be the need to search sth. over e.g. a whole year. Furthermore, this kind of logs are the most, but not the only logs. Other logs are written too into another index prefix (but will be deleted after an amount of time). So there should be still a litte space for more data.

How would (or are) you handle the indices to increase the time, but also still be able to search over a bigger time range? If I make a fulltext search without fields over the last 24h, I am currently at ~40 seconds.

Christian_Dahlqvist · July 13, 2018, 8:32am

What is the specification of your cluster, hardware and storage?

How many queries are you serving per second given you have set up so many replicas?

What type of queries are run? What is the use case?

How much data do you currently have in the cluster?

MarcusCaepio · July 13, 2018, 11:16am

What is the specification of your cluster, hardware and storage?
8 Nodes, each with: 24 cores (HT), 64 GB Ram, 24 TB HDDs dedicated for elastic, 4 of them with logstash installed as indexer

How many queries are you serving per second given you have set up so many replicas?

Not many, we have dashboards and we sometimes make queries for special searches.
We use it for central network logs. I used so many replicas because of the slides of https://de.slideshare.net/swallez/black-friday-logs-scaling-elasticsearch (Slide nr. 57). I understood it like "the higher the count of data is, in which you are searching, the more replicas you should have for search speed). So, when I search in billions of documents, I need the replicas for performance(?). Number of shards based on index size, as one shard should not be bigger than 50 gb and one index is 250 GB.

What type of queries are run? What is the use case?
Dashboards, queries in discover tab. Searching in network logs

How much data do you currently have in the cluster?
I started right now, so there are not so much yet.
Total Shards: 298
Documents: 1,562,516,987
Data: 2.8 TB ( = 700 GB primary)

Christian_Dahlqvist · July 13, 2018, 11:26am

You typically scale out the number of replicas in order to handle more concurrent queries. If you have few concurrent queries I would probably recommend having only 1 replica. This will cut the amount of data in the cluster in half compared to the current configuration.

MarcusCaepio · July 13, 2018, 12:19pm

Does this have negative effects on querries, which search in a bigger time range? E.g. few days, a whole year?

Christian_Dahlqvist · July 13, 2018, 12:27pm

If you have a limited number of concurrent queries and the shards queried are well distributed it should hot have much impact.

MarcusCaepio · July 13, 2018, 2:32pm

Ok thanks for the tip, reduced the replicas to 1. Anymore tips? Does it make sense to reindex (last month) daily indices to a big monthly index and delete the daylies?

Christian_Dahlqvist · July 13, 2018, 3:16pm

If you want to put that much data on the nodes you will need to manage heap usage carefully, e.g. use large shards, forcemerge to minimize number of segments, optimise mapping to reduce heap usage and possibly also use coordinating nodes for querying to minimize the load on the data nodes.

system · August 10, 2018, 3:27pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Setting up Multi-node Architecture of ELK for log monitoring Elasticsearch	6	745	June 10, 2019
How many Shards / Replicas Elasticsearch	9	9898	July 5, 2017
Using replicas for long term storage in SAN Elasticsearch	10	2004	July 5, 2017
Hardware for ELK Elasticsearch	8	489	May 7, 2018
Analyzing logs and document limit per shard Elasticsearch	11	1293	February 21, 2017

Experiences in "how to manage much data" needed

Related topics