I have a device which is sending 500GB text data (logs) per day to my central server. I want to design a system using which user can:
Apply exact-match filters and go through data using pagination
Export PDF/CSV reports for same query as above
Data can be stored for max 6 months. Its an on-premise solution. Some delay on queries is affordable. If we can do data compressions it would be great. I have 512GB RAM, 80core system and TBs of storage(these are upgradable)
What I have tried/found out:
Tech stack iam planning to use: MEAN stack for application dev. For core data part iam planning to use ELK stack. Elasticsearch single index can have <40-50gb ideal size recommendation. So, my plan is create 100 indexes per day each of 5GB for each device. (100*5 = 500gb)
During query I can sort these indices based on their name (eg. 12_dec_2012_part_1 ...) and search into each index linearly and keep on doing this till the range user has asked. (I think this will hold good for ad-hoc request by user, but for reports if I do this and write to a csv file by going sequentially one by one it will take long time.)
For reports I think best thing i can do is create pdf/csv for each index(5gb size), reason because most file openers cannot open very large csv/pdf files.
Iam new to big data problems. Iam not sure what approach is right; ELK or Hadoop ecosystem for this. (I would like to go with ELK) or a combination of both.
I am planning to use 1 node cluster - its an on-prem solution and we may not get multiple machines for multi cluster solution. ELK will be deployed as docker.
Questions:
Is it a good approach design for problems like this. Please suggest. It is even feasible to do this way?
Can 1 node have 10000+ indexes. I will be querying 1 index(5gb) at a time by a single user at a time.
my plan is create 100 indexes per day each of 5GB for each device. (100*5 = 500gb)
This is probably the wrong approach to solving this issue. This will generate a significant number of small shards which you generally don't want. There are a few better ways of doing this:
If the data you're working with is time-based, look into using data streams with index life cycle management (ILM). You can have a single "index" that data is written to, and have it automatically rollover when the current index gets near the desired size. Depending on the number of nodes you have, you can tune your desired shards and replicas for the backing index to allow for higher write/read throughput.
If your data isn't time based, then you should still stick to one "index", but adjust the number of backing shards to try and attain shard sizes as close to 50Gb as possible.
During query I can sort these indices based on their name (eg. 12_dec_2012_part_1 ...) and search into each index linearly and keep on doing this till the range user has asked.
You shouldn't need to do this. Elasticsearch has a lot of this complex logic/optimization built-in already and trying to do it outside of Elasticsearch is actually probably less efficient.
Can 1 node have 10000+ indexes. I will be querying 1 index(5gb) at a time by a single user at a time.
Few points here:
1 node is not recommended for production use, there is no high availability here. It is generally recommended that for production deployments, you should have a minimum of 3 Elasticsearch nodes that are master eligible.
I have 512GB RAM, 80core system and TBs of storage(these are upgradable)
This would be a lot of RAM and CPU for a single Elasticsearch node. You can probably run multiple Elasticsearch instances on this single node to make better use of the resources.
With Elasticsearch, you can only allocate ~30GB of RAM to heap, the rest would be used by the OS & file system. With this amount of resources, you could probably run ~3-6 Elasticsearch nodes.
I will be querying 1 index(5gb) at a time by a single user at a time.
Again, I wouldn't try to implement this logic yourself, let Elasticsearch handle the search logic.
What is the best way for reports(csv/pdf)
There are a few different ways here:
Kibana supports a nice out of the box experience for end-user data exports via reporting
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.