Hi all,
We are trying to move some of our offline data analytics from hadoop hive
stack to elasticsearch, but ran in to some issue.
We have daily event, in hive we use partition (hdfs directories) to store
daily events. For instance , the hdfs directory layout of event table is
like below
event/dt=20141112
event/dt=20141113
user retention is tracking if a user produce an event(activity) today and
produce an event in another day. the sql is like
SELECT count(*)
FROM event-log-20141112 AS l
JOIN event-log-20141112 AS r
ON l.user_id = r.user_id
According to the documentation of elasticsearch, we can build one index per
day, like log-20141112/event, log-20141113/event. But seems different
index can't do a join as fast as co-locate through routing. If we store
all the events in one index, each type represent one day's event. Seems
there is still no way to do user retention query.
Actually we can collapse all the events by user id. Maintaining a parent
table stores users' information, including user id. Each day of event
declares user information table as its parent table. The layout should like
event/user
event/log-20141112
event/log-20141113
All of those tables can be routed by user_id, so that those table will
co-located. If they doing a join, no data shuffling needed. However, seems
currently easlticsearch can't do a query related to multiple children
tables join, they just do parent-child join, right?
Can anyone help me on this? or if there is another solution on
elasticsearch?
Min
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3d2f12ed-96aa-4239-98fe-1297b196397d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.