Would an alternative approach be to denormalize and for each mcu object
create a document with the same timestamp but with the individual property
values set. That way I can aggregate on them and do filtering but obviously
that introduces duplication for the parent data
Generally, denormalizing is the way to go, and how you denormalize depends on what kind of analysis you want to do. It would work to create multiple documents for each MCU/Stat and it is up to you how many fields you would want to duplicate from the original stat document. Internally, Elasticsearch uses compression when storing the data, and duplicated data compresses well.
A method that has worked for me before was to denormalize nested data by having an array field in my "parent" document for each field in a nested type. For example,
I had an orders.product_details field which was nested type, and that object had product_name (string), product_aisle (string), and a product_reordered (boolean) which were details for each product in the order. I indexed the data with extra fields: orders.product_names (array string), orders.product_names (array string), and orders.reorder_count (number). That allowed me to do some simple analysis on the data.
In my case, it was acceptable to lose the linking between some of the inner data fields. If I wanted to answer a question like, "how many times do people reorder yogurt?" I would have to calculate a field for yogurt reorders when I re-index the data.
My only concern with what I have at the moment which is a complete
duplication of the stats data with the mcu data on that document is if
someone does a sum of X on the stats data it will obviously, in this case
be multiplied by 3 (because I have 3 mcus in my nested data).
In your product example how did you tie up the reorder count for product A
for example if you had an array of reorder counts, shouldn't that be tied
to a product name id somehow?
UPDATE: Ah I see you said you lost that linked data and you would have to then change the data if you wanted the id linking. Guess I need to predict what analysis customers might want to do with our MCU data.
That is pretty much true, your best bet would be to talk to the customers and find out what they're interested in.
In this case, if that sum is important, then you'll want to add a field for X at a higher level that encompasses the relevant X values, and the customers can do a sum aggregation on that field. That would have to be handled at index time in the data pipeline.
I'm a bit confused on this one. If I have a doc with X =1 and has a MCU1 denormalized data here then a 2nd doc with X =1 and a third doc with X=1 assuming all docs have the same timestamp as that's what we've derived then summing X will be 3 for that timestamp but actually its only 1 because we have duplicated data. Or when denormalizing do we try and keep one document for that timestamp?
Yes, when denormalizing, probably the easiest way to go would be to keep a single document, and add additional calculated fields. When analyzing the data, the calculated fields will be used, not the inner nested fields.
So we move the values we are interested in as top level properties that are created somehow possibly by the .NET client or some other ES magic and then create aggregated values as well or does kibana aggregate over multiple fields?
I would go with creating that as a field on index. You can sum multiple fields with a scripted field in Kibana, but you can't search or aggregate on a scripted field.
Also, if you are able to loop though all the MCU's that are related to a document before indexing, then it doesn't matter how many MCU's there are in the relationship -- you don't have to assume there will always be 3.
Yup I was just using 3 as an example. My only concern with adding the
useful properties and the sum for example is that I will have to add other
aggregations like count,bag,min,max and kind of guess what users might want
to do with the data.
I've been playing with POST _ingest/pipeline/_simulate however I can't seem to create unique fields on the parent object. For example I can use the foreach processor and the set processor but I can't obviously create fields that are unique per iteration of the foreach statement eg/cant create fields mcu1, mcu2 etc The below probably gets executed 3 times but I end up with 1 property
I would probably look into using the script processor for this. If you have more questions about setting up this pipeline, opening a new Discuss thread under the Elasticsearch topic would probably get you the best help