Feedback on data model (over 1 billion documents)

Hi,
We are planning to use ES to search through almost 2 billion documents
(and growing fast). Each document has one or more social interaction
associated with it. A search should be performed on document data as
well as on social interactions linked to it. We would like to have
community feedback on the model we have chosen.

We want to be able to do the following; imagine one document with two
social interactions. One interaction mentioning 'tree' and the other
'house'. A search on 'tree AND house' would yield this document.

We are in doubt how to record social interactions. We came up with
this model and it works for our search requirement:

  1. a unique URL field
  2. an array of social interactions
  3. a social interaction consists of several text and integer fields

(See this Gist for a more complete JSON representation:
https://gist.github.com/1751349 )

The problem is appending social interactions. For every incoming
social interaction we have to do a GET request, checking if this
particular document already exists or not. If it does append the
interaction and POST. If it doesn't create a new record and POST. Is
this a problem in terms of overhead? We think it is.
Another problem with this is that we want to have multiple processes
updating/inserting documents. If two processes want to update (or
create) the same document this will lead to inconsistencies. We know
of the version functionality of ES, should we try to harness that?

An other problem entirely is the potential size of a document. Imagine
a document having tens of thousands of social interactions. Would the
document size grow prohibitively large? We expect to search on users.
A user is recorded in a social interaction. The search would yield the
whole (huge) document (and possibly more documents), rather than
returning only his interactions. Can we do something about this? Trim
the document, for example, before returning it?

Perhaps we should choose an other data model. Your help is greatly
appreciated.

Cheers
Nitish

Flattening the data model would solve your problems assuming you can live
with it.
You can add URL and rank as properties to each interaction and store each
interaction as a separate document. You would get multiple docs per URL in
the results, but it may be feasible to handle that in your application
code.
With this data model, you'd only do a write to ES for each interaction
hence you'd get much better performance.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Mon, Feb 6, 2012 at 5:26 AM, Nitish Sharma sharmanitishdutt@gmail.comwrote:

Hi,
We are planning to use ES to search through almost 2 billion documents
(and growing fast). Each document has one or more social interaction
associated with it. A search should be performed on document data as
well as on social interactions linked to it. We would like to have
community feedback on the model we have chosen.

We want to be able to do the following; imagine one document with two
social interactions. One interaction mentioning 'tree' and the other
'house'. A search on 'tree AND house' would yield this document.

We are in doubt how to record social interactions. We came up with
this model and it works for our search requirement:

  1. a unique URL field
  2. an array of social interactions
  3. a social interaction consists of several text and integer fields

(See this Gist for a more complete JSON representation:
ES Data Model Skylines · GitHub )

The problem is appending social interactions. For every incoming
social interaction we have to do a GET request, checking if this
particular document already exists or not. If it does append the
interaction and POST. If it doesn't create a new record and POST. Is
this a problem in terms of overhead? We think it is.
Another problem with this is that we want to have multiple processes
updating/inserting documents. If two processes want to update (or
create) the same document this will lead to inconsistencies. We know
of the version functionality of ES, should we try to harness that?

An other problem entirely is the potential size of a document. Imagine
a document having tens of thousands of social interactions. Would the
document size grow prohibitively large? We expect to search on users.
A user is recorded in a social interaction. The search would yield the
whole (huge) document (and possibly more documents), rather than
returning only his interactions. Can we do something about this? Trim
the document, for example, before returning it?

Perhaps we should choose an other data model. Your help is greatly
appreciated.

Cheers
Nitish

Your data model sounds a lot like a graph. You may want to look in to
a graph database like Neo4J coupled with Lucene directly rather than
Elasticsearch.

Berkay suggestion is a good one, what I would like to know more is what type of searches will be executed? i.e. do you expect to get the URLs back, or specific user interactions?

On Monday, February 6, 2012 at 4:10 PM, Berkay Mollamustafaoglu wrote:

Flattening the data model would solve your problems assuming you can live with it.
You can add URL and rank as properties to each interaction and store each interaction as a separate document. You would get multiple docs per URL in the results, but it may be feasible to handle that in your application code.
With this data model, you'd only do a write to ES for each interaction hence you'd get much better performance.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Mon, Feb 6, 2012 at 5:26 AM, Nitish Sharma <sharmanitishdutt@gmail.com (mailto:sharmanitishdutt@gmail.com)> wrote:

Hi,
We are planning to use ES to search through almost 2 billion documents
(and growing fast). Each document has one or more social interaction
associated with it. A search should be performed on document data as
well as on social interactions linked to it. We would like to have
community feedback on the model we have chosen.

We want to be able to do the following; imagine one document with two
social interactions. One interaction mentioning 'tree' and the other
'house'. A search on 'tree AND house' would yield this document.

We are in doubt how to record social interactions. We came up with
this model and it works for our search requirement:

  1. a unique URL field
  2. an array of social interactions
  3. a social interaction consists of several text and integer fields

(See this Gist for a more complete JSON representation:
ES Data Model Skylines · GitHub )

The problem is appending social interactions. For every incoming
social interaction we have to do a GET request, checking if this
particular document already exists or not. If it does append the
interaction and POST. If it doesn't create a new record and POST. Is
this a problem in terms of overhead? We think it is.
Another problem with this is that we want to have multiple processes
updating/inserting documents. If two processes want to update (or
create) the same document this will lead to inconsistencies. We know
of the version functionality of ES, should we try to harness that?

An other problem entirely is the potential size of a document. Imagine
a document having tens of thousands of social interactions. Would the
document size grow prohibitively large? We expect to search on users.
A user is recorded in a social interaction. The search would yield the
whole (huge) document (and possibly more documents), rather than
returning only his interactions. Can we do something about this? Trim
the document, for example, before returning it?

Perhaps we should choose an other data model. Your help is greatly
appreciated.

Cheers
Nitish

Hi
Thanks a lot for your replies folks!
We are already aware that flattening the data model would help us gain
significant indexing performance compared to graph like data model
we currently have.
There are two major problems with a flattened data model:

  1. A search would literally return thousands of documents and most of
    them would be pointing to same URL (since they are social interactions
    on same entity). Consequently, filtering out unique documents would be
    a time as well space (memory) consuming task.
  2. One specific type of search query (the one we described in previous
    post) cannot be supported with this model. If the social interactions
    mentioning "tree" and "house", respectively, are separate documents,
    then a search on "tree AND house" would not yield either of them.
    While we expect this search to return the URL, since the entity
    (pointed by URL) has "tree" as well as "house" keyword in the
    associated interactions. Is it possible to perform this type of query
    even on a flattened data model using some Elasticsearch construct (we
    are not aware of)?

Regarding what we expect from a search depends on the type of the
search. Some searches are required to return only (unique)URLs, while
some other should return URLs as well as specific user interaction.

Cheers
Nitish

On Feb 7, 11:17 am, Shay Banon kim...@gmail.com wrote:

Berkay suggestion is a good one, what I would like to know more is what type of searches will be executed? i.e. do you expect to get the URLs back, or specific user interactions?

On Monday, February 6, 2012 at 4:10 PM, Berkay Mollamustafaoglu wrote:

Flattening the data model would solve your problems assuming you can live with it.
You can add URL and rank as properties to each interaction and store each interaction as a separate document. You would get multiple docs per URL in the results, but it may be feasible to handle that in your application code.
With this data model, you'd only do a write to ES for each interaction hence you'd get much better performance.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Mon, Feb 6, 2012 at 5:26 AM, Nitish Sharma <sharmanitishd...@gmail.com (mailto:sharmanitishd...@gmail.com)> wrote:

Hi,
We are planning to use ES to search through almost 2 billion documents
(and growing fast). Each document has one or more social interaction
associated with it. A search should be performed on document data as
well as on social interactions linked to it. We would like to have
community feedback on the model we have chosen.

We want to be able to do the following; imagine one document with two
social interactions. One interaction mentioning 'tree' and the other
'house'. A search on 'tree AND house' would yield this document.

We are in doubt how to record social interactions. We came up with
this model and it works for our search requirement:

  1. a unique URL field
  2. an array of social interactions
  3. a social interaction consists of several text and integer fields

(See this Gist for a more complete JSON representation:
ES Data Model Skylines · GitHub)

The problem is appending social interactions. For every incoming
social interaction we have to do a GET request, checking if this
particular document already exists or not. If it does append the
interaction and POST. If it doesn't create a new record and POST. Is
this a problem in terms of overhead? We think it is.
Another problem with this is that we want to have multiple processes
updating/inserting documents. If two processes want to update (or
create) the same document this will lead to inconsistencies. We know
of the version functionality of ES, should we try to harness that?

An other problem entirely is the potential size of a document. Imagine
a document having tens of thousands of social interactions. Would the
document size grow prohibitively large? We expect to search on users.
A user is recorded in a social interaction. The search would yield the
whole (huge) document (and possibly more documents), rather than
returning only his interactions. Can we do something about this? Trim
the document, for example, before returning it?

Perhaps we should choose an other data model. Your help is greatly
appreciated.

Cheers
Nitish

Neo4j actually uses Lucene as its default backend index:
http://docs.neo4j.org/chunked/snapshot/indexing.html

On Mon, Feb 6, 2012 at 2:15 PM, Dan Everton dan@iocaine.org wrote:

Your data model sounds a lot like a graph. You may want to look in to
a graph database like Neo4J coupled with Lucene directly rather than
Elasticsearch.

@Shay: Do you also think we should give a hard look into Neo4J? Its
search aspect is not as powerful, though.
Is there any possible way to store relationships between various
documents? I even tried the new update API to append interactions in
the document as they come in, but thats also really slow. Any
suggestions?

Cheers
N.

On Feb 9, 12:29 am, Ivan Brusic i...@brusic.com wrote:

Neo4j actually uses Lucene as its default backend index:http://docs.neo4j.org/chunked/snapshot/indexing.html

On Mon, Feb 6, 2012 at 2:15 PM, Dan Everton d...@iocaine.org wrote:

Yourdatamodelsounds a lot like a graph. You may want to look in to
a graph database like Neo4J coupled with Lucene directly rather than
Elasticsearch.

You can store relationship between documents using the parent/child feature, but, you will need to make sure that a parent and its children can exist on a single shard (so they can be joined).

I have not used neo4j, so can't comment. What I can say is that with highly connected data, you still need to somehow partition the data at one point (or stay with a single server).

On Monday, February 13, 2012 at 6:40 PM, Nitish Sharma wrote:

@Shay: Do you also think we should give a hard look into Neo4J? Its
search aspect is not as powerful, though.
Is there any possible way to store relationships between various
documents? I even tried the new update API to append interactions in
the document as they come in, but thats also really slow. Any
suggestions?

Cheers
N.

On Feb 9, 12:29 am, Ivan Brusic <i...@brusic.com (http://brusic.com)> wrote:

Neo4j actually uses Lucene as its default backend index:http://docs.neo4j.org/chunked/snapshot/indexing.html

On Mon, Feb 6, 2012 at 2:15 PM, Dan Everton <d...@iocaine.org (http://iocaine.org)> wrote:

Yourdatamodelsounds a lot like a graph. You may want to look in to
a graph database like Neo4J coupled with Lucene directly rather than
Elasticsearch.

@Shay: Thanks very much. This parent/child feature may just do the
trick for us. We've experimented a bit with it, and it seems to fit
our requirements. Though, there are few more things we need from it,
namely:

  1. Getting all children of a parent. I suppose there is no official
    API call for that. Is it even possible to do that?
  2. While doing a parent/child search, is it possible to define that
    the results should contain only parent document or child documents or
    both of them?

Cheers
Nitish
On Feb 14, 3:02 pm, Shay Banon kim...@gmail.com wrote:

You can store relationship between documents using the parent/child feature, but, you will need to make sure that a parent and its children can exist on a single shard (so they can be joined).

I have not used neo4j, so can't comment. What I can say is that with highly connected data, you still need to somehow partition the data at one point (or stay with a single server).

On Monday, February 13, 2012 at 6:40 PM, Nitish Sharma wrote:

@Shay: Do you also think we should give a hard look into Neo4J? Its
search aspect is not as powerful, though.
Is there any possible way to store relationships between various
documents? I even tried the new update API to append interactions in
the document as they come in, but thats also really slow. Any
suggestions?

Cheers
N.

On Feb 9, 12:29 am, Ivan Brusic <i...@brusic.com (http://brusic.com)> wrote:

Neo4j actually uses Lucene as its default backend index:http://docs.neo4j.org/chunked/snapshot/indexing.html

On Mon, Feb 6, 2012 at 2:15 PM, Dan Everton <d...@iocaine.org (http://iocaine.org)> wrote:

Yourdatamodelsounds a lot like a graph. You may want to look in to
a graph database like Neo4J coupled with Lucene directly rather than
Elasticsearch.

@Shay: I have another issue with parent/child search. I've been trying
to get filtered search work with parent/child structure. But I get an
error: "Parse Failure [No parser for element [filtered]".
Here is the gist: Parent/child Filter Search · GitHub
Can you point out whats wrong with it?

Cheers
Nitish
On Feb 14, 3:02 pm, Shay Banon kim...@gmail.com wrote:

You can store relationship between documents using the parent/child feature, but, you will need to make sure that a parent and its children can exist on a single shard (so they can be joined).

I have not used neo4j, so can't comment. What I can say is that with highly connected data, you still need to somehow partition the data at one point (or stay with a single server).

On Monday, February 13, 2012 at 6:40 PM, Nitish Sharma wrote:

@Shay: Do you also think we should give a hard look into Neo4J? Its
search aspect is not as powerful, though.
Is there any possible way to store relationships between various
documents? I even tried the new update API to append interactions in
the document as they come in, but thats also really slow. Any
suggestions?

Cheers
N.

On Feb 9, 12:29 am, Ivan Brusic <i...@brusic.com (http://brusic.com)> wrote:

Neo4j actually uses Lucene as its default backend index:http://docs.neo4j.org/chunked/snapshot/indexing.html

On Mon, Feb 6, 2012 at 2:15 PM, Dan Everton <d...@iocaine.org (http://iocaine.org)> wrote:

Yourdatamodelsounds a lot like a graph. You may want to look in to
a graph database like Neo4J coupled with Lucene directly rather than
Elasticsearch.

Regarding the query, you need to wrap the filtered part in a "query" element as well.

Getting back the children for the parents will require an additional call, you can only get the parents matching the query back.

On Wednesday, February 15, 2012 at 6:33 PM, Nitish Sharma wrote:

@Shay: I have another issue with parent/child search. I've been trying
to get filtered search work with parent/child structure. But I get an
error: "Parse Failure [No parser for element [filtered]".
Here is the gist: Parent/child Filter Search · GitHub
Can you point out whats wrong with it?

Cheers
Nitish
On Feb 14, 3:02 pm, Shay Banon <kim...@gmail.com (http://gmail.com)> wrote:

You can store relationship between documents using the parent/child feature, but, you will need to make sure that a parent and its children can exist on a single shard (so they can be joined).

I have not used neo4j, so can't comment. What I can say is that with highly connected data, you still need to somehow partition the data at one point (or stay with a single server).

On Monday, February 13, 2012 at 6:40 PM, Nitish Sharma wrote:

@Shay: Do you also think we should give a hard look into Neo4J? Its
search aspect is not as powerful, though.
Is there any possible way to store relationships between various
documents? I even tried the new update API to append interactions in
the document as they come in, but thats also really slow. Any
suggestions?

Cheers
N.

On Feb 9, 12:29 am, Ivan Brusic <i...@brusic.com (http://brusic.com)> wrote:

Neo4j actually uses Lucene as its default backend index:http://docs.neo4j.org/chunked/snapshot/indexing.html

On Mon, Feb 6, 2012 at 2:15 PM, Dan Everton <d...@iocaine.org (http://iocaine.org)> wrote:

Yourdatamodelsounds a lot like a graph. You may want to look in to
a graph database like Neo4J coupled with Lucene directly rather than
Elasticsearch.