Elastic Search for storing Historical data

code_blue · September 14, 2015, 1:58am

We are considering storing Historical data in ES. Is there any pattern or best practice guideline defined for this?

The way we want to do this is via a scheduled job that will pull data on a defined interval and store in Index. So one document will exist multiple times inside a mapping. We will add a timestamp to identify when the document was loaded in ES.

In such a scenario what should the _id be defined as?

Also any suggestions regarding storing historical data in ES will be helpful.

warkolm · September 14, 2015, 2:32am

This is pretty much what ELK is built for

So;

Use time based indices, daily/weekly/monthly
Specify your mappings in advance
Look into hot+warm architecture - https://www.elastic.co/blog/hot-warm-architecture
Use Elasticsearch Curator https://www.elastic.co/guide/en/elasticsearch/client/curator/current/index.html
I am sure others have recommendations

But why are you worried about what _id would be?

code_blue · September 14, 2015, 7:33am

Hey Mark

Thanks for your response.
We want to store data in single index & run aggregations with date histogram on time stamp. Main objective is to generate analytical data using aggregations and we don't want to run aggregations over multiple indices as we are not sure how performant that will be.

Since we are looking to use a single index and same document will exist multiple times along with time stamp hence worried about _id.

warkolm · September 14, 2015, 7:47am

Honestly you won't notice any difference between having a single index and multiple ones that contain the same data.
Plus it makes retention management massively easier and should remove your concern about the _id.

However to make that even less of a problem just let ES pick the ID and the put your message ID in it's own field.

code_blue · September 14, 2015, 8:13am

Creating multiple index for every time interval may lead to index maintainability overhead in our case. For e.g. if we are capturing data at end of every month then end of the year we will end up having 12 indices and every time we ran our aggregation queries we will have to add the new indices.

If i let ES pick the _id in same index and inside same mapping wont ES replace my earlier document since it may happen that no data has changed from the earlier time stamp?

warkolm · September 14, 2015, 8:15am

So? How are you going to manage removal of old data from the index if you have a single one?

True. How often are they likely to have the same ID though?

code_blue · September 14, 2015, 10:00am

Since we are considering historical data in single index so not sure if we need to have data removal.

Every mapping inside my index corresponds to some data in my data source. As we are loading data from data source at some predefined time stamp so there will be multiple copies of same data inside my mapping.

code_blue · September 15, 2015, 3:19am

Is there any convention to keep the same document multiple times within an index but with a different _id

warkolm · September 15, 2015, 4:05am

Just assign your own ID.

vionemc · January 20, 2016, 4:00am

What he meant is an answer like in this article:
http://blog.mongodb.org/post/65517193370/schema-design-for-time-series-data-in-mongodb

Maybe what is the recommended database schema for storing historical data using ElasticSearch?

warkolm · January 20, 2016, 4:05am

ES isn't a database

It really depends what sort of historic data, is it time based, or just "old"

vionemc · January 20, 2016, 4:16am

I know, but since ElasticSearch doesn't support working together with MongoDB anymore, in the end we force using ElasticSearch as a database.

Storing historical data in MongoDB and ElasticSearch seems to be different. For example, if we store a string "2016-01-03 00:00:00" in MongoDB, we can process that string as a date directly. I am afraid it won't be that way in ElasticSearch. Maybe ElasticSearch will prefer timestamp, or a date object.

My need specifically is to store a history of how many twitter followers I have for every date.

This article already answer about getting the history
https://www.elastic.co/guide/en/elasticsearch/guide/current/_looking_at_time.html

But that article didn't give a sample of the stored index which they process. I want to follow the sample schema and understand it before applying any history storage in my database.

vionemc · January 20, 2016, 4:17am

BTW, you should prepare ElasticSearch to be also a database beside a search engine. A lot of people already forcing ElasticSearch to be a database. Even including HipChat.

vionemc · January 20, 2016, 4:18am

My current structure in MongoDB:

{ "_id" : "15454221-2016", #string "follower_history" : { "2016-01-01 00:00:00" : { #date, first day of the month, UTC time "values" : { "2016-01-03 00:00:00" : 1505, #key:date, first day of the month, UTC time;value: integer "2016-01-07 00:00:00" : 1508, "2016-01-08 00:00:00" : 1508 }, "num_samples" : 3, #integer "total_follower" : 4521 #integer } } }

warkolm · January 20, 2016, 4:20am

A time and a date are the same thing in ES, a timestamp, and ES will detect that.

There's two ways of doing what you want;

Have an index per day with a single counter that you update and can simply read
Have an index per day and record all "new follower" events, then run an agg on it.

A lot of people use redis as a persistent store, doesn't mean that is what it is or it's right.

vionemc · January 20, 2016, 5:12am

Yup, just suggesting. I think it will be like heaven if Elasticsearch is also a database.

vionemc · January 20, 2016, 5:14am

May I ask for a sample indexed json? That I can try to aggregate. Just a simple one will suffice.

warkolm · January 20, 2016, 5:15am

I don't have them, that was just some ideas.

vionemc · January 20, 2016, 5:19am

OK, thanks.

vionemc · January 20, 2016, 5:26am

Try this article on how to store the data:

It's more on the recommended data structure.

And this article on how to view the data:
https://www.elastic.co/guide/en/elasticsearch/guide/current/_looking_at_time.html#CO196-2
Mainly using aggregate

Topic		Replies	Views
Logstash control records Logstash	9	335	July 6, 2017
Help with data structure (small project) Elasticsearch	9	404	May 29, 2020
Update document on multiple indices Logstash	2	952	May 3, 2017
Hide a document from the searching Elasticsearch	5	836	July 7, 2020
Purpose and usage of index at ES Elasticsearch	2	328	July 19, 2019

Elastic Search for storing Historical data

Related topics