Question about modeling data when doing user retention analytics based on time-based event

Hi all,

We are trying to move some of our offline data analytics from hadoop hive
stack to elasticsearch, but ran in to some issue.

We have daily event, in hive we use partition (hdfs directories) to store
daily events. For instance , the hdfs directory layout of event table is
like below

event/dt=20141112
event/dt=20141113

user retention is tracking if a user produce an event(activity) today and
produce an event in another day. the sql is like

SELECT count(*)
FROM event-log-20141112 AS l
JOIN event-log-20141112 AS r
ON l.user_id = r.user_id

According to the documentation of elasticsearch, we can build one index per
day, like log-20141112/event, log-20141113/event. But seems different
index can't do a join as fast as co-locate through routing. If we store
all the events in one index, each type represent one day's event. Seems
there is still no way to do user retention query.

Actually we can collapse all the events by user id. Maintaining a parent
table stores users' information, including user id. Each day of event
declares user information table as its parent table. The layout should like

event/user
event/log-20141112
event/log-20141113

All of those tables can be routed by user_id, so that those table will
co-located. If they doing a join, no data shuffling needed. However, seems
currently easlticsearch can't do a query related to multiple children
tables join, they just do parent-child join, right?

Can anyone help me on this? or if there is another solution on
elasticsearch?

Min

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3d2f12ed-96aa-4239-98fe-1297b196397d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.