Best practice | storage & update | user data


(Piyush Goyal) #1

Hi team,

We are using elasticsearch in our production environment for more 2 years now. Up until now it was more of permanent storage where updates to the data were minimal. Now we have a requirement of doing the user segmentation of our user set based on the data already stored in ES. The user data is in a different type and the segmentation rules is based on another time series data set which is stored in another type.

What are the best recommendations from ES team to update this kind of data regularly? Should an array against each member ID is something where in we keep on adding our segment values? or every time a new segment comes in picture, a new field should be created to store that segment values against all those users. Anyways both the above way would require fetching the document at application layer and updating it. Any help would be recommended.

Regards
Piyush


(Nik Everett) #2

I can't quite figure out what you want to do, but that is ok! Its hard to describe complex things in general ways. Anyway, the general advice for updates is to use the update API. You can use scripts or doc merging. The two most important things to remember about updates is:

  1. The analysis is redone every time. Expensive analysis makes updates more expensive.
  2. Deleted documents are removed during the segment merge process so it takes some time for them to be removed from the disk.

Otherwise you should be ok with updates.


(Piyush Goyal) #3

Thanks @nik9000 for the response. I guess I should explain the use case in a little bit detailed manner. Think of a type "member" with a single field "memberID". The document count in this type goes somewhere around 10K. Every now and then, we create a new segment example "learning style" and calculate the value of this segment for each member and have to store it in same type against each member ID. Now every few hours we have a new segment whose value is again calculated for each member and stored against member ID.

What I was looking out for was a storage pattern for such kind of data storage where user segmentation happens. One was to add a new field everytime a new segment is introduced in the system. Another was to create an array/nested object and store the new segments as part of array/nested object. Definitely both would require an update which as you mentioned can be done through update API, I don't want a scenario where after a few days the documents fetched at the application layer to add a new segment becomes too large and kind of difficult to maintain which would eventually become expensive as well.

Question is which one gives the best performance and is there any kind of storage pattern which ES recommends.

Cheers
Piyush


(system) #4