Why is housekeeping not built into elasticsearch?

ThomasB · April 28, 2017, 12:55am

Hi,
I have a (hopefully) quick question, regarding index management and deleting time based indices. What is the reason, that there is no built-in support to delete all indices older than X days?

I know there is curator, which is built for exactly this use case, but it seems a bit odd to me to rely on an "external" tool to to the housekeeping in a product, which is built with time series data in mind. I think there is something implemented on X-Pack Monitoring, but I am not sure.
I can't think of any to do this natively in elasticsearch, without requiring some sort of trigger from the outside world

I had a discussion about running elasticsearch inside a container and a colleague of mine is so opposed to the idea, to have a second process running inside that container, that he argued to write a elasticsearch plugin to regularly purge old indices.

Is it that curator was just good enough until now and nobody really needed it?
Is it that you think, that it is not elasticsearch's job to do this. (since cron is a well established tool and a possible implementation would more or less look like cron)?

BR

a.w · April 28, 2017, 1:24am

Hey,

I don't know if i am right and correct me if im wrong, but after my opinion Elasticsearch is not and will never be a database for storing data.
For this you should use CouchDB, MariaDB, or any kind of SQL or noSQL Database you can find.
Elasticsearch is an index, which can only index and search documents/indices but does this good as hell!

If you have a software with a common architecture where e.g. CouchDB is you database and Elastic your Search cluster, you have to have some kind of software running in between this two components to share and maintain the data between database and elastic anyway.

This way you already have to manage and control the data by yourself and as a software developer I alway create jobs like this by myself to keep the control and I would not use a tool which is already build in Elastic for deleting old documents. Query them, and delete them. done.

Guess that's the reason why there is no such thing like that, just because regularly you would never "just" use elastic, feed it with data and remove the oldest, its nearly always much more complex like that.

ThomasB · April 28, 2017, 9:26am

Hey,
I think it depends on the use case. We are just collecting some log files (a few GB per day) and we are using "just" elasticsearch to store the data. In a simple use case like this, there is nothing wrong with using elasticsearch as data store.
Maybe a feature like this would provide too little benefit, for the complexity it requires, while not be able to live up to the expectation people might have.

theuntergeek · April 28, 2017, 7:57pm

Are you referring to Curator being built with time-series data in mind, or Elasticsearch? Elasticsearch was certainly not built with time-series data in mind. It started off as a good old-fashioned search engine. Logs, analytics, and other stuff came later, and with it, the idea (or requirement) for data to be segmented into time-series.

Not yet, but coming soon.

There isn't. And even in the future, you'll still have to design and/or write said trigger yourself. Though, with multiple iterations and an eventual UI, it will become easier over time.

Curator does not need to run in the same container as Elasticsearch. It can run from anywhere, so long as it has TCP access to the port Elasticsearch is listening on.

Curator is open source, and has been "good enough" for several years. It may not have been everything to everybody, but it does a lot. There have been many requests for a Curator UI in Kibana, or a dedicated plugin in Elasticsearch, or X-Pack. They've been considered. Some of these features are coming in a future version of X-Pack.

In not so many words, yes. Index management is a very situational decision, and it's one that has changed dramatically since Elasticsearch started. Consider the new _rollover API, that allows you to keep "time-series" data on a more or less arbitrary schedule. Moreover, you can rollover when a certain document count has been reached, keeping index and shard counts far lower than daily or hourly indices would be. Elasticsearch has been providing the tools to manage your indices the way you'd like to, rather than providing a one-size-fits-all solution that would end up leading to other complaints.

You have essentially asked, "Why doesn't Elasticsearch do this on its own?" which could be considered a mild complaint. If Elasticsearch provided a built-in but narrowly-scoped solution for index management, we would get complaints about the narrowness of the scope: "Why isn't this feature here?" Why can't I do x, y, or z?" Instead we provide the tools to make it happen the way you want, and even provide Curator as it handles most of the most common use cases. This is why @a.w states quite clearly:

Still, we understand that there are an increasing number of users who want turnkey solutions to this. Curator tries hard to strike a balance between a high amount of configurability, and a high degree of usability. It's not an easy balance to maintain. We've had multiple discussions over the past year about how to add Curator-like functionality to X-Pack, and a clean way forward hasn't materialized yet. You hit the nail on the head, @ThomasB:

Curator has been the right amount of benefit at the cost of only a single developer in his part time

system · May 26, 2017, 8:10pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.