Hey Guys,
I am seeking advice on design a system that maintains a historical view of
a user's activities in past one year. Each user can have different
activities: email_open, email_click, item_view, add_to_cart, purchase etc.
The query I would like to do is, for example,
Find all customers who browse item A in the past 6 month, and also clicked
an email.
and I would like the query to be done in reasonable time frame. (for
example, within 30 minutes to retrieve 10million such users)
Is ES a good candidate for such problem? I am thinking creating an index
for each user, but that would have too many indexes(millions). I also tried
to index each activity(userid, activity_type, item_id, timestamp etc) as a
individual document to ES, but it involves join operations which turns out
not so efficient(I am using parent-child).
Has any of you have experience in designing similar system? As I think this
is a rather common problem that need to be solved..(Of cause we can do it
in map reduce)
Any suggestion is appreciated.
The short answer is yes. I've leveraged ES to store events and analyze them
in time-based chunks. It's actually a very powerful tool for this type of
application. However, you will have to decide on how to model your data to
get the most out of it. The first question I would ask is why do you
require so many joins? What is the purpose for each join operation, and do
you necessarily need to join to everything up front? Can you denormalize
some of the data to get what you need in the first pass, and then drill
down afterwards? Take into account use experience, and how your
queries/model will support that experience.
On Monday, January 12, 2015 at 4:51:55 PM UTC-8, Chen Wang wrote:
Hey Guys,
I am seeking advice on design a system that maintains a historical view of
a user's activities in past one year. Each user can have different
activities: email_open, email_click, item_view, add_to_cart, purchase etc.
The query I would like to do is, for example,
Find all customers who browse item A in the past 6 month, and also clicked
an email.
and I would like the query to be done in reasonable time frame. (for
example, within 30 minutes to retrieve 10million such users)
Is ES a good candidate for such problem? I am thinking creating an index
for each user, but that would have too many indexes(millions). I also tried
to index each activity(userid, activity_type, item_id, timestamp etc) as a
individual document to ES, but it involves join operations which turns out
not so efficient(I am using parent-child).
Has any of you have experience in designing similar system? As I think
this is a rather common problem that need to be solved..(Of cause we can do
it in map reduce)
Any suggestion is appreciated.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.