Design question - relationships across indices


(Ludwig Magnusson) #1

In my application I want to index user profiles and events. Each event is
performed by a specific user at a specific time. I would like to be able to
look at a specific event and see statistics on the users that have
preformed them. To make this possible I have done some initial experiments
with parent/child relationships. I index user documents and give each
document searchable attributes such as age. I then map the events to have
the user as parent, referencing the id. This has proved to work very well
for my querying requirements. I can extract the statistics I want in very
flexible ways and I can also update information about users if I need to.
However, the problem is that in a parent/child relationship the parent and
child documents needs to be in the same shard, which in this case seems a
bit problematic when it comes to scaling.

I expect to receive many events and therefore I would like to place them in
different indexes based on time, lets say one index per day, week or month
to be able to easily scale up to different servers when the need arises and
to be able to archive old data by removing old indices. This however seems
to remove the possibility of having the parent/child relationship. Since
the user data is not time based, and since users would reference events in
different time indices it would need to be stored in its own index.

To sum up, the basic requirements are:

  • Being able to query the data with great flexibility
  • Scalability
  • Being able to archive old data

The solutions that come to mind are:

  1. Keep the model explained above but do not use the parent/child
    mapping and put users and events in different indices. Do one query to
    fetch all the users, and then a separate query to get all events that has
    the fetched user ids. However, this does not seem to be very efficient (my
    guess) and it could potentially send a lot of information across nodes in
    the network since one query could match perhaps 50 000 different users.
  2. Wait for this pull requesthttps://github.com/elasticsearch/elasticsearch/pull/3278 to
    be merged and use that feature. But would that not be the same thing as
    solution 1 in practice?
  3. Model the data in a different way. If there is a better way to do it,
    how would it be modeled?

Thanks in advance for any advice and/or feedback
/Ludwig

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/98c763c7-f64e-49ca-9afb-c3b6b6efb017%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #2