Elastic Search for storing Historical data


#1

We are considering storing Historical data in ES. Is there any pattern or best practice guideline defined for this?

The way we want to do this is via a scheduled job that will pull data on a defined interval and store in Index. So one document will exist multiple times inside a mapping. We will add a timestamp to identify when the document was loaded in ES.

In such a scenario what should the _id be defined as?

Also any suggestions regarding storing historical data in ES will be helpful.


(Mark Walkom) #2

This is pretty much what ELK is built for :slight_smile:

So;

But why are you worried about what _id would be?


#3

Hey Mark

Thanks for your response.
We want to store data in single index & run aggregations with date histogram on time stamp. Main objective is to generate analytical data using aggregations and we don't want to run aggregations over multiple indices as we are not sure how performant that will be.

Since we are looking to use a single index and same document will exist multiple times along with time stamp hence worried about _id.


(Mark Walkom) #4

Honestly you won't notice any difference between having a single index and multiple ones that contain the same data.
Plus it makes retention management massively easier and should remove your concern about the _id.

However to make that even less of a problem just let ES pick the ID and the put your message ID in it's own field.


#5

Creating multiple index for every time interval may lead to index maintainability overhead in our case. For e.g. if we are capturing data at end of every month then end of the year we will end up having 12 indices and every time we ran our aggregation queries we will have to add the new indices.

If i let ES pick the _id in same index and inside same mapping wont ES replace my earlier document since it may happen that no data has changed from the earlier time stamp?


(Mark Walkom) #6

So? How are you going to manage removal of old data from the index if you have a single one?

True. How often are they likely to have the same ID though?


#7

Since we are considering historical data in single index so not sure if we need to have data removal.

Every mapping inside my index corresponds to some data in my data source. As we are loading data from data source at some predefined time stamp so there will be multiple copies of same data inside my mapping.


#8

Is there any convention to keep the same document multiple times within an index but with a different _id


(Mark Walkom) #9

Just assign your own ID.


(Vionemc) #10

What he meant is an answer like in this article:
http://blog.mongodb.org/post/65517193370/schema-design-for-time-series-data-in-mongodb

Maybe what is the recommended database schema for storing historical data using ElasticSearch?


(Mark Walkom) #11

ES isn't a database :wink:

It really depends what sort of historic data, is it time based, or just "old"


(Vionemc) #12

I know, but since ElasticSearch doesn't support working together with MongoDB anymore, in the end we force using ElasticSearch as a database.

Storing historical data in MongoDB and ElasticSearch seems to be different. For example, if we store a string "2016-01-03 00:00:00" in MongoDB, we can process that string as a date directly. I am afraid it won't be that way in ElasticSearch. Maybe ElasticSearch will prefer timestamp, or a date object.

My need specifically is to store a history of how many twitter followers I have for every date.

This article already answer about getting the history
https://www.elastic.co/guide/en/elasticsearch/guide/current/_looking_at_time.html

But that article didn't give a sample of the stored index which they process. I want to follow the sample schema and understand it before applying any history storage in my database.


(Vionemc) #13

BTW, you should prepare ElasticSearch to be also a database beside a search engine. A lot of people already forcing ElasticSearch to be a database. Even including HipChat.


(Vionemc) #14

My current structure in MongoDB:

{ "_id" : "15454221-2016", #string "follower_history" : { "2016-01-01 00:00:00" : { #date, first day of the month, UTC time "values" : { "2016-01-03 00:00:00" : 1505, #key:date, first day of the month, UTC time;value: integer "2016-01-07 00:00:00" : 1508, "2016-01-08 00:00:00" : 1508 }, "num_samples" : 3, #integer "total_follower" : 4521 #integer } } }


(Mark Walkom) #15

A time and a date are the same thing in ES, a timestamp, and ES will detect that.

There's two ways of doing what you want;

  1. Have an index per day with a single counter that you update and can simply read
  2. Have an index per day and record all "new follower" events, then run an agg on it.

A lot of people use redis as a persistent store, doesn't mean that is what it is or it's right.


(Vionemc) #16

Yup, just suggesting. I think it will be like heaven if ElasticSearch is also a database.


(Vionemc) #17

May I ask for a sample indexed json? That I can try to aggregate. Just a simple one will suffice.


(Mark Walkom) #18

I don't have them, that was just some ideas.


(Vionemc) #19

OK, thanks.


(Vionemc) #20

Try this article on how to store the data:


It's more on the recommended data structure.

And this article on how to view the data:
https://www.elastic.co/guide/en/elasticsearch/guide/current/_looking_at_time.html#CO196-2
Mainly using aggregate