Handling date-based-indexing and parent/child relationships

We are using Elastic for realtime web analytics. Because we don't care about long-term data in ES, we need a way to delete old data easily, the normal solution to this to use a data-based index, so old indexes are deleted every X weeks. However, this means that every time the index increments, we lose the ability to query all parent/child relationships between documents on one side of the increment vs the other. A parent hit that happens a midnight will be in a different index than its child events.

We use parent-child relationships between session/pageview and pageview/event with the following basic relationship:

  • a Session has many Pageviews which have many Events.
  • session/pageview are mostly denormalized
  • pageview/event are completely normalized
  • we need to support querying events using the parent/child queries

We are trying to determine the best solution.

  • Because of the scale of data, we can not denormalize hits/events. Each hit can have dozens of events. (We have a separate index of raw server logs and it takes up 5-6 times as much disk space)
  • Week-based or monthly indexes cause the same problem less often
  • A single massive index is not a great solution

Right now, it seems our best solution is storing the event into the index of the hit based on the hit timestamp. Are there any other good solutions?

Notes:

  • Our indexes are version/date based. If we push a version change, a new index is created but backwards incompatible version changes are rare
  • PageViews are relatively short lived: measured in minutes/hours.
  • The majority of sessions are measured in hours/days, but some can be weeks/months (users who do not close their browsers.)
  • Because it's a minority of data and we use denormalized data, we are okay losing some parent sessions.
  • We process all of our incoming data to files, then injest with logstash to handle basic transforms like changing session_id -> _parent etc.
  • The scale of data is "small", 500m documents a month, though this will probably triple once all the kinks are worked out

And as an addendum, with the removal of _type in 6.0, how will this work with the new join syntax in 5.6/6.0? I see joins are still same-shard, but indexes can only have one _type?

The choices here are to flatten everything, so each event has all the info required to drop the use of parent/child.
Or to have two indices, one for the events and one for the entities, and then join within your app/code.

Parent/Child will have a special field to join on, that's all _type is anyway.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.