500Gb text data per day - how to design elk solution

coldcoder8502 · February 14, 2023, 3:49am

I have a device which is sending 500GB text data (logs) per day to my central server. I want to design a system using which user can:

Apply exact-match filters and go through data using pagination
Export PDF/CSV reports for same query as above

Data can be stored for max 6 months. Its an on-premise solution. Some delay on queries is affordable. If we can do data compressions it would be great. I have 512GB RAM, 80core system and TBs of storage(these are upgradable)

What I have tried/found out:

Tech stack iam planning to use: MEAN stack for application dev. For core data part iam planning to use ELK stack. Elasticsearch single index can have <40-50gb ideal size recommendation. So, my plan is create 100 indexes per day each of 5GB for each device. (100*5 = 500gb)
During query I can sort these indices based on their name (eg. 12_dec_2012_part_1 ...) and search into each index linearly and keep on doing this till the range user has asked. (I think this will hold good for ad-hoc request by user, but for reports if I do this and write to a csv file by going sequentially one by one it will take long time.)
For reports I think best thing i can do is create pdf/csv for each index(5gb size), reason because most file openers cannot open very large csv/pdf files.

Iam new to big data problems. Iam not sure what approach is right; ELK or Hadoop ecosystem for this. (I would like to go with ELK) or a combination of both.

I am planning to use 1 node cluster - its an on-prem solution and we may not get multiple machines for multi cluster solution. ELK will be deployed as docker.

Questions:

Is it a good approach design for problems like this. Please suggest. It is even feasible to do this way?
Can 1 node have 10000+ indexes. I will be querying 1 index(5gb) at a time by a single user at a time.
What is the best way for reports(csv/pdf)

Please suggest how to proceed. Thanks !!

BenB196 · February 14, 2023, 12:19pm

Few things:

my plan is create 100 indexes per day each of 5GB for each device. (100*5 = 500gb)

This is probably the wrong approach to solving this issue. This will generate a significant number of small shards which you generally don't want. There are a few better ways of doing this:

If the data you're working with is time-based, look into using data streams with index life cycle management (ILM). You can have a single "index" that data is written to, and have it automatically rollover when the current index gets near the desired size. Depending on the number of nodes you have, you can tune your desired shards and replicas for the backing index to allow for higher write/read throughput.
If your data isn't time based, then you should still stick to one "index", but adjust the number of backing shards to try and attain shard sizes as close to 50Gb as possible.

During query I can sort these indices based on their name (eg. 12_dec_2012_part_1 ...) and search into each index linearly and keep on doing this till the range user has asked.

You shouldn't need to do this. Elasticsearch has a lot of this complex logic/optimization built-in already and trying to do it outside of Elasticsearch is actually probably less efficient.

Can 1 node have 10000+ indexes. I will be querying 1 index(5gb) at a time by a single user at a time.

Few points here:

1 node is not recommended for production use, there is no high availability here. It is generally recommended that for production deployments, you should have a minimum of 3 Elasticsearch nodes that are master eligible.
I have 512GB RAM, 80core system and TBs of storage(these are upgradable)
- This would be a lot of RAM and CPU for a single Elasticsearch node. You can probably run multiple Elasticsearch instances on this single node to make better use of the resources.
- With Elasticsearch, you can only allocate ~30GB of RAM to heap, the rest would be used by the OS & file system. With this amount of resources, you could probably run ~3-6 Elasticsearch nodes.

I will be querying 1 index(5gb) at a time by a single user at a time.

Again, I wouldn't try to implement this logic yourself, let Elasticsearch handle the search logic.

What is the best way for reports(csv/pdf)

There are a few different ways here:

Kibana supports a nice out of the box experience for end-user data exports via reporting
For really large data exports, elasticsearch-dump works well for CSV.

system · March 14, 2023, 12:20pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Recommendations needed for large ELK system design Elasticsearch	7	1159	July 6, 2017
Experiences in "how to manage much data" needed Elasticsearch	8	549	August 10, 2018
Hardware for ELK Elasticsearch	8	487	May 7, 2018
Need advice on building a new production ELK cluster Elasticsearch	2	140	March 28, 2024
Cluster configuration for log storage. 140Gb/day Elasticsearch	5	3381	November 10, 2017

500Gb text data per day - how to design elk solution

Related topics