Handling date-based-indexing and parent/child relationships

We are using Elastic for realtime web analytics. Because we don't care about long-term data in ES, we need a way to delete old data easily, the normal solution to this to use a data-based index, so old indexes are deleted every X weeks. However, this means that every time the index increments, we lose the ability to query all parent/child relationships between documents on one side of the increment vs the other. A parent hit that happens a midnight will be in a different index than its child events.

We use parent-child relationships between session/pageview and pageview/event with the following basic relationship:

  • a Session has many Pageviews which have many Events.
  • session/pageview are mostly denormalized
  • pageview/event are completely normalized
  • we need to support querying events using the parent/child queries

We are trying to determine the best solution.

  • Because of the scale of data, we can not denormalize hits/events. Each hit can have dozens of events. (We have a separate index of raw server logs and it takes up 5-6 times as much disk space)
  • Week-based or monthly indexes cause the same problem less often
  • A single massive index is not a great solution

Right now, it seems our best solution is storing the event into the index of the hit based on the hit timestamp. Are there any other good solutions?


  • Our indexes are version/date based. If we push a version change, a new index is created but backwards incompatible version changes are rare
  • PageViews are relatively short lived: measured in minutes/hours.
  • The majority of sessions are measured in hours/days, but some can be weeks/months (users who do not close their browsers.)
  • Because it's a minority of data and we use denormalized data, we are okay losing some parent sessions.
  • We process all of our incoming data to files, then injest with logstash to handle basic transforms like changing session_id -> _parent etc.
  • The scale of data is "small", 500m documents a month, though this will probably triple once all the kinks are worked out

And as an addendum, with the removal of _type in 6.0, how will this work with the new join syntax in 5.6/6.0? I see joins are still same-shard, but indexes can only have one _type?

The choices here are to flatten everything, so each event has all the info required to drop the use of parent/child.
Or to have two indices, one for the events and one for the entities, and then join within your app/code.

Parent/Child will have a special field to join on, that's all _type is anyway.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.