We are using Elastic for realtime web analytics. Because we don't care about long-term data in ES, we need a way to delete old data easily, the normal solution to this to use a data-based index, so old indexes are deleted every X weeks. However, this means that every time the index increments, we lose the ability to query all parent/child relationships between documents on one side of the increment vs the other. A parent hit that happens a midnight will be in a different index than its child events.
We use parent-child relationships between session/pageview and pageview/event with the following basic relationship:
- a Session has many Pageviews which have many Events.
- session/pageview are mostly denormalized
- pageview/event are completely normalized
- we need to support querying events using the parent/child queries
We are trying to determine the best solution.
- Because of the scale of data, we can not denormalize hits/events. Each hit can have dozens of events. (We have a separate index of raw server logs and it takes up 5-6 times as much disk space)
- Week-based or monthly indexes cause the same problem less often
- A single massive index is not a great solution
Right now, it seems our best solution is storing the event into the index of the hit based on the hit timestamp. Are there any other good solutions?
- Our indexes are version/date based. If we push a version change, a new index is created but backwards incompatible version changes are rare
- PageViews are relatively short lived: measured in minutes/hours.
- The majority of sessions are measured in hours/days, but some can be weeks/months (users who do not close their browsers.)
- Because it's a minority of data and we use denormalized data, we are okay losing some parent sessions.
- We process all of our incoming data to files, then injest with logstash to handle basic transforms like changing
- The scale of data is "small", 500m documents a month, though this will probably triple once all the kinks are worked out