We are considering storing Historical data in ES. Is there any pattern or best practice guideline defined for this?
The way we want to do this is via a scheduled job that will pull data on a defined interval and store in Index. So one document will exist multiple times inside a mapping. We will add a timestamp to identify when the document was loaded in ES.
In such a scenario what should the _id be defined as?
Also any suggestions regarding storing historical data in ES will be helpful.
Thanks for your response.
We want to store data in single index & run aggregations with date histogram on time stamp. Main objective is to generate analytical data using aggregations and we don't want to run aggregations over multiple indices as we are not sure how performant that will be.
Since we are looking to use a single index and same document will exist multiple times along with time stamp hence worried about _id.
Honestly you won't notice any difference between having a single index and multiple ones that contain the same data.
Plus it makes retention management massively easier and should remove your concern about the _id.
However to make that even less of a problem just let ES pick the ID and the put your message ID in it's own field.
Creating multiple index for every time interval may lead to index maintainability overhead in our case. For e.g. if we are capturing data at end of every month then end of the year we will end up having 12 indices and every time we ran our aggregation queries we will have to add the new indices.
If i let ES pick the _id in same index and inside same mapping wont ES replace my earlier document since it may happen that no data has changed from the earlier time stamp?
Since we are considering historical data in single index so not sure if we need to have data removal.
Every mapping inside my index corresponds to some data in my data source. As we are loading data from data source at some predefined time stamp so there will be multiple copies of same data inside my mapping.
I know, but since ElasticSearch doesn't support working together with MongoDB anymore, in the end we force using ElasticSearch as a database.
Storing historical data in MongoDB and ElasticSearch seems to be different. For example, if we store a string "2016-01-03 00:00:00" in MongoDB, we can process that string as a date directly. I am afraid it won't be that way in ElasticSearch. Maybe ElasticSearch will prefer timestamp, or a date object.
My need specifically is to store a history of how many twitter followers I have for every date.
But that article didn't give a sample of the stored index which they process. I want to follow the sample schema and understand it before applying any history storage in my database.
BTW, you should prepare ElasticSearch to be also a database beside a search engine. A lot of people already forcing ElasticSearch to be a database. Even including HipChat.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.