We have a small, single node Elastic Stack configuration running. Once we reach about 2 billion documents, our installation starts to have trouble keeping up. For this reason, I've configured Curator to delete indexes from ElasticSearch that are older than two weeks. Things are tidy and all is good.
I've now been tasked with keeping a subset of that data without loss of granularity. Does anyone out there currently run a solution for this or have thoughts on the best way to do it? The raw data is .txt json files that are dumped into a folder that logstash ingests via the file input.
Brainstormed Ideas
Copy the raw txt files to an archival point. When the need arises to look at some timeframe of datapoints, we take the raw data and stick it back in the ingest folder to be ingested into the stack. The downside is the extra load it puts on the stack. In addition there is duplicate data that exists across multiple files (document id'ing prevents event duplication in Elasticsearch).
Utilize the file output plugin to write the data to a file and then when we need to analyze historical data, use a purpose built pipeline that basically takes the data and passes it directly to elasticsearch. The question I am left with is, can the file output be configured to create a new file over a given time frame or file size? Ideally, I'd like it to write a new file every day and the end result be gzipped, but what kind of buffer/processing time would that require?
Export the Elasticsearch index which is recreated daily. How this is done or if it is even possible, I don't know and I may need to pose the question in the Elasticsearch forums.
Single node cluster. Server 2012 R2 OS, 48GB of RAM, 6 CPU running at 2Ghz. Logstash and Elasticsearch each have 17GB assign to their heaps. This is actually a new spec in terms of RAM, as of yesterday. Prior to increasing their heaps and when experienced issues, logstash only had 5gb and elasticsearch had 6gb.
Note - we haven't answered your original question because there seems to be a problem with the way things are setup. 2 billion docs/1.1TB of data on a single node is not much, so maybe we can resolve that.
That's an awful lot of heap for Logstash, why is it so large?
Also I think your shard size is too small. If you are using daily indices then you should look at weekly/monthly), aiming for <50GB.
As I said, it's a new heap size as of yesterday. I was expressing our issues with compute constraints the other day and our infrastructure guy wanted to throw RAM at it to see what that would do. I just gave both applications a bigger chunk of RAM to play with. This morning I looked at it and saw a minimal uptick in usage by Logstash so it will probably be backed down to 8GB.
Our Winlogbeat indices are right around 35-50GB with an average of about 38GB. The subset of data I need to retain is CloudFlare logs which currently come out to about 3GB/day, though we expect that to increase to roughly 40GB once we fully migrate all our sites onto CloudFlare. That said, how can I go about setting index boundaries to weekly or monthly?
Also, while I definitely appreciate the performance tuning, I still need to figure out a good archival method for the data. The thought is that we retain it for years and I doubt the current setup can maintain six months of data, let alone years worth.
@warkolm, To create indices on a weekly basis, would I change the Elasticsearch output index to index => "index-%{+YYYY.w}" and then for monthlies, instead of using w, use M?
Anybody have any suggestions on this? I'm trying the file output but it creates "massive" raw text files that are unwieldy if not compressed and I don't know of a mechanism to compress the files automatically.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.