Design question: index and mapping types for facetted search on billions of events


I want to do facetted search on events pushed by a device, facetted by device configuration:

I have devices objects with several configuration fields: these are the fields I need to query to filter devices.

Each device pushes metrics fields with a timestamp and the device ID.

What I ultimately need to do is query and aggregate the event metrics by device config: that is, to get all device IDs that match a filter query and aggregate metrics over time for all devices matching those IDs.

This seems like a pretty standard thing to do

Looking at the docs, I see 2 ways:

  1. nested fields: put the configuration fields in each event. Possible but expensive. Would insure an event matches a config even if it changes.
  2. use the _parent type: where each event is a child of the config document

Option 2 seems more efficient, but I have questions and I see several potential problems:

The _parent field documentation does not make much sense to me:

it says:

The _parent.type setting can only point to a type that doesn’t exist yet. This means that a type cannot become a parent type after it is has been created.

I would think you need to create the parent type, then create children referencing the parent.
A) How can a field point to a type that doesn't exist yet?
Maybe I'm confused about the meaning of creating a type.

This leads to another question:
B) If I were to use a _parent field, can the mapping of the parent object change (i.e. add fields) ?

C) Also what happens if the parent object changes (i.e. is updated) ?

D) When querying with _has_parent does it refer to the parent's latest state or does ElasticSearch internally denormalized the parent object in the child on creation?

I am also concerned about the requirements on _parent:
the doc says:

Parent and child documents must be indexed on the same shard. The parent ID is used as the routing value for the child, to ensure that the child is indexed on the same shard as the parent. This means that the same parent value needs to be provided when getting, deleting, or updating a child document.

That seems pretty restrictive first for ingestion (how would I do this with Logstash for example) but also more problematic is the fact that there will be billions of children per parent, so how could they all be indexed on the same shard?

Is this used case just not really suited for Elasticsearch? Or is there a way to achieve this in better ways?

Thanks for your input.

I'd just flatten everything rather than trying to add relationships, it just makes things complex.
Have the metric with all the applicable info in it that you would need to filter.

1 Like

Seems like a lot of overhead (i.e. inserting 100's of extra keys in billions of docs) but I can see it makes life easier in the search side...
disk is cheap but I wonder if there is another way.

anyways... now how would I 'merge' my docs?
I basically have a stream that saves (updates) a doc in a config index. (1 doc per device with its config)
now I need to merge this existing doc with every incoming event...
Is this something that is feasible directly in ES? I was thinking pushing the latest doc to Redis as it comes in and pull from it on every event, but if ES has a way to do this, that may be better.

Any input on this?


You could have the device data/config in an index, then get the events and merge them in Logstash using, or potentially with a few