I have a logstash configuration parsing a set of files (which are being updated daily) and overwriting same "iccid" on the monthly index. The expected behaviour from Kibana perspective, is to see a latest output every time the data source has update, and stay at group at the end of every last day of month. The problem is, the first day of updated file will still be parsed into last month index something-%{YYYY.MM}.
I understand this is caused by the the @timestamp in Logstash is in UTC timezone by default. @magnusbaeck said this is not configurable and not a problem in this post but in my case, it is a problem.
Expected result: Last group of data should be 29 Feb with index something-2020.02. Instead it was on first day of March with same index something-2020.02. Because as the first batch of data coming in 1st March, the actual UTC @timestamp will still be on 29th Feb, hence data was parsed into index something-2020.02, instead of creating new index something-2020.03. My timezone is GMT+8.
Setting the visualization tool timezone to UTC does solve the problem but I would rather stay in the browser local timezone. Is there any way to optimize my configuration to achieve the expected result?
We are using ES to serve as a service for client as a Data usage monitoring service. The idea is client will always see the most update data usage in current month, that explain we overwriting the same ID in the index. Proceeding to next month, a new monthly index will be created and last month index should be stopped updating and stay the same as the last update in last day of month. So the client will still be able to check last month data usage.
In this case however, the last month index is being updated in current month which means there is no data last month, whereas last month data combine with current month data, we have duplicated data for current month.
If you are trying to filter data the timestamp of data using the index name then you have an architectural problem that is a fundamental conflict with the use of elasticsearch. elasticsearch lets you query data by timestamp across any set of indexes. You can ask for all last month's records. It doesn't matter which indexes they come from. Generally it will ignore records much faster than it will try to include records, so ignoring large numbers of records is not a huge problem.
As you are updating documents it may make sense to not use time-based indices. This means that you need to delete data using delete-by-query rather than simply deleting indices, but if the data volumes are not huge this may be fine even though it is less efficient. In order to store the latest version for each calendar month I would recommend considering generating the document_id as a concatenation of the accid, year and month rather than relying on separating the data by index.
In fact I realised our visualization tool(Grafana) is only using timestamp to query the data base on the dynamic time frame user define. The thing is, the monthly index can help to put a stop at last day of month so when the time frame covers last month, there is data to be query. Now the last day of month data was pushed one day further to the first day of next month, due to the timezone difference in Logstash(UTC) and Grafana(GMT+8).
In this example, there is no data from 1st Feb to 29th Feb, because they are overwritten in 1st Mar.
generating the document_id as a concatenation of the accid , year and month rather than relying on separating the data by index.
Would you please elaborate more in term of Logstash configuration?
I found out our tool is not actually using the monthly index for query but the index serves the purpose to stop data overwriting iccid if new month has started.
I've ended up fixing the issue by tuning my data source update time (by using cronjob). Since I'm in GMT+8 timezone and updating data from 00:00 to 08:00, Logstash UTC timezone would make the entry ended up in previous day, thus mess up the last day of month input.
Start the data updating at 09:00 solves the problem.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.