We are planning to use ES to search through almost 2 billion documents
(and growing fast). Each document has one or more social interaction
associated with it. A search should be performed on document data as
well as on social interactions linked to it. We would like to have
community feedback on the model we have chosen.
We want to be able to do the following; imagine one document with two
social interactions. One interaction mentioning 'tree' and the other
'house'. A search on 'tree AND house' would yield this document.
We are in doubt how to record social interactions. We came up with
this model and it works for our search requirement:
- a unique URL field
- an array of social interactions
- a social interaction consists of several text and integer fields
(See this Gist for a more complete JSON representation:
The problem is appending social interactions. For every incoming
social interaction we have to do a GET request, checking if this
particular document already exists or not. If it does append the
interaction and POST. If it doesn't create a new record and POST. Is
this a problem in terms of overhead? We think it is.
Another problem with this is that we want to have multiple processes
updating/inserting documents. If two processes want to update (or
create) the same document this will lead to inconsistencies. We know
of the version functionality of ES, should we try to harness that?
An other problem entirely is the potential size of a document. Imagine
a document having tens of thousands of social interactions. Would the
document size grow prohibitively large? We expect to search on users.
A user is recorded in a social interaction. The search would yield the
whole (huge) document (and possibly more documents), rather than
returning only his interactions. Can we do something about this? Trim
the document, for example, before returning it?
Perhaps we should choose an other data model. Your help is greatly