We are using Elastic for realtime web analytics. Because we don't care about long-term data in ES, we need a way to delete old data easily, the normal solution to this to use a data-based index, so old indexes are deleted every X weeks. However, this means that every time the index increments, we lose the ability to query all parent/child relationships between documents on one side of the increment vs the other. A parent hit that happens a midnight will be in a different index than its child events.
We use parent-child relationships between session/pageview and pageview/event with the following basic relationship:
- a Session has many Pageviews which have many Events.
- session/pageview are mostly denormalized
- pageview/event are completely normalized
- we need to support querying events using the parent/child queries
We are trying to determine the best solution.
- Because of the scale of data, we can not denormalize hits/events. Each hit can have dozens of events. (We have a separate index of raw server logs and it takes up 5-6 times as much disk space)
- Week-based or monthly indexes cause the same problem less often
- A single massive index is not a great solution
Right now, it seems our best solution is storing the event into the index of the hit based on the hit timestamp. Are there any other good solutions?
Notes:
- Our indexes are version/date based. If we push a version change, a new index is created but backwards incompatible version changes are rare
- PageViews are relatively short lived: measured in minutes/hours.
- The majority of sessions are measured in hours/days, but some can be weeks/months (users who do not close their browsers.)
- Because it's a minority of data and we use denormalized data, we are okay losing some parent sessions.
- We process all of our incoming data to files, then injest with logstash to handle basic transforms like changing
session_id
->_parent
etc. - The scale of data is "small", 500m documents a month, though this will probably triple once all the kinks are worked out