Hi there,
I would like to periodically generate an index whose data will come from the outcome of a SQL query.
Even though I'm aware it is not the traditional use case as the data are not logs nor accumulative, LogStash gives me a good starting point: it makes it very easy to run the query via JDBC and map the data to the index.
Ideally the flow would be something like:
Index-A-Yesterday already exists and ES alias for Index-A points to that index
Call LogStash to create Index-A-Today
Warm-up Index-A-Today (optional step)
Call ES API to switch Index-A alias to point towards Index-A-Today
Delete Index-A-Yesterday
Is there a way via LogStash to this? I can't find it ...
I've seen this topic already asked here without a satisfactory answer and since it is closed I try again, worst case I will get an official "not possible".
Not possible in Logstash, currently or perhaps ever.
While it is possible to create an index based on a field value, the place where you are likely to hit a wall is aliasing. Logstash has no mechanism at all to rotate aliases.
The common misunderstanding is that Logstash actually creates indices at all. It doesn't. This is an important distinction. As you understand the actual flow, then the reason why the flow you described cannot work will become clear.
Logstash sends a bulk request to Elasticsearch requesting that a given document (or log line, or DB row, or whatever the event is) be indexed in the index named "logstash -YYYY.MM.dd" (or whatever you have specified for the index =>. Elasticsearch interprets these requests, and if the named in the bulk request exists, it puts the document there. If it doesn't exist, Elasticsearch creates the index, and puts the document there. At no point ever does Logstash actually call the _create index API. This is the reason why Logstash cannot do alias rotation. It never performs any timed behaviors at all.
It may be of benefit for you to reconsider your data storage model to allow for other indexing and management possibilities. One such is the _rollover API, which can rollover indices and simultaneously do the alias switching that you've described. The downside of this approach (by itself) is that there's a fractional chance of getting some of the next day's data in "today's" index, or vice versa. But it might be just as useful to not keep daily indices, and let your application do the date filtering for you, rather than rely on date-range indices.
Now, with that explanation out of the way, there are potential work-arounds that might make what you want to do possible. One such is to use Elasticsearch Curator
Disable automatic index creation
This approach would simply make Logstash spool up traffic to the new index until some other process created it. This would be safest with Logstash 5.4, which has the official release of persistent queues, so the data would safely spool to the local disk on Logstash until "connectivity" was restored to the new index.
Use Curator to create the new index
This would have to be done via cron to run at exactly 00:00 UTC time, as that's when index rollover happens.
Use Curator to change the alias
This could easily be done as a subsequent step/action in the same configuration file.
The downside to this approach is that if something doesn't run right (misconfiguration, for example), you could be spooling data to Logstash's persistent queues for a while. No data would be lost, but it would take a while to empty those queues out if they build up to a large size.
Thanks @theuntergeek for such detailed and clear information!
I think I will go for a custom application/script as in our context:
It is not acceptable to mix data from different batches. This discards the _rollover approach
I don't know how often will we refresh the data (plus I guess we will change the periodicity in the future).
We want to be able to manually force an update of the index easily (just in case). Last two topics discard the Curator approach (as it's bound to a specific time).
Well, if you code or script in Python, feel free to use the Curator API. This would allow you to re-use the parts that make sense, and code around what doesn't—without having to reinvent the wheel, as it were.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.