I'm quite new to ES and would need some guidance from you, if possible.
The scenario is the following:
I have approximately 20 unique people, all identified by a separate UUID. Each person has around 2000 unique statistics coupled with a corresponding value.
An example of raw data would be something like this:
id: UUID
name: John Doe
stats: ["jumps" -> 23, "farts" -> 52, "kilometersWalked" -> 33, ...]
The statistics would be updated when an certain event triggers it and at a maximum interval of one time per hour. As of now I'm stuck of how to solve this, as in what would be a good way to store this?
Side note: The statistic types are the same for all people, i.e these 2000 types are pre-defined as well as all values are integers.
I have one question: are the metrics cumulative or each document will report the delta since the last update?
I think the best approach would be to have 2 indices:
time based indices, which will contain one document per metric update per user
one entity centric index where you can store the latest event you received from each user
The second one is not strictly necessary, but it can make some operations more handy.
To get the latest stats for a given user, you can query on the user uuid ordering by timestamp descending.
Also, Transform Jobs might help to build up stats over time for each user (e.g. average number of steps).
My idea was for it to be cumulative. On a certain triggered event an iteration of all statistic of said person will update (if changed). My idea was to do a bulk update on existing fields.
You can even overwrite the whole document, indexing via bulk update (if you receive all the updates at once) using the user uuid as _id.
But you would lose the evolution over time.
Oh that sounds neat, is it better to overwrite than update? So I'd create a new document by using the path <identifier>/<uuid> but what would you recommend to use as data types for my case? I assume choosing Object is a must for my case, or is it possible to use something similar to tuples (for all stats on said person)?
The evolution over time is not necessary in my case because what I aim for is to reflect current state only.
I see, thanks a lot. I assume my following question is very subjective depending on hardware but is it feasible to have 2000 entries in my stats array. I've understood that ES uses a flatten data structure, but is there a good rule of thumb or similar when time complexity goes through the roof?
As soon as you want to introduce a new metric, before sending the data with the new metric, update the mapping to specify the new field and its type. Then send the new data.
I would not try to normalize data. Especially because you will have one document per user.
To optimize, you could even switch off the storage of the _source, but it has some consequences.
As all your keys are uniques, I do not see any interest on sending a list of key/values.
Send directly an hashmap / dict.
At the end, Elasticsearch will in any way index them as detailed in the mapping because it flattens internally the arrays (except if you use nested datatype, but it is not useful on your case).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.