We have a small, single node Elastic Stack configuration running. Once we reach about 2 billion documents, our installation starts to have trouble keeping up. For this reason, I've configured Curator to delete indexes from ElasticSearch that are older than two weeks. Things are tidy and all is good.
I've now been tasked with keeping a subset of that data without loss of granularity. Does anyone out there currently run a solution for this or have thoughts on the best way to do it? The raw data is .txt json files that are dumped into a folder that logstash ingests via the file input.
Copy the raw txt files to an archival point. When the need arises to look at some timeframe of datapoints, we take the raw data and stick it back in the ingest folder to be ingested into the stack. The downside is the extra load it puts on the stack. In addition there is duplicate data that exists across multiple files (document id'ing prevents event duplication in Elasticsearch).
Utilize the file output plugin to write the data to a file and then when we need to analyze historical data, use a purpose built pipeline that basically takes the data and passes it directly to elasticsearch. The question I am left with is, can the file output be configured to create a new file over a given time frame or file size? Ideally, I'd like it to write a new file every day and the end result be gzipped, but what kind of buffer/processing time would that require?
Export the Elasticsearch index which is recreated daily. How this is done or if it is even possible, I don't know and I may need to pose the question in the Elasticsearch forums.