Help with designing our document for graphs. Indexing single nodes in graph with thousands of incoming edges

Hey guys,
We're currently storing entities and edges in Cassandra. The entities
are JSON, and edges are directed edges with a source---type-->target.
We're using ElasticSearch for indexing and I could really use a hand with
design.

What we're doing currently, is we take an entity, and turn it's JSON into a
document. We then create multiple copies of our document and change it's
type to match the index. For instance, Image the following use case.

bob(user) -- likes -- > Duo (restaurant) ===> Document Type = bob(user)

  • likes + restaurant ; bob(user) + likes

bob(user) -- likes -> Root Down (restaurant) ===> Document Type =
bob(user) + likes+ restaurant ; bob(user) + likes

bob(user) -- likes --> Coconut Porter (beer). ===> Document Types =
bob(user) + likes + beer; bob(user) + likes

When we index using this scheme we create 3 documents based on the
restaurants Duo and Root Down, and the beer Coconut Porter. We then store
this document 2x, one for it's specific type, and one in the "all" bucket.

Essentially, the document becomes a node in the graph. For each incoming
directed edge, we're storing 2x documents and changing the type. This
gives us fast seeks when we search by type, but a LOT of data bloat. Would
it instead be more efficient to keep an array of incoming edges in the
document, then add it to our search terms? For instance, should we instead
have a document like this?

docId: Duo(restaurant)

edges: [ "bob(user) + likes + restaurant", "bob(user) + likes" ]

When searching where edges = "bob(user) + likes + restaurant"?

I don't know internally what specifying type actually does, if it just
treats it as as field, or if it changes the routing of the response? In
a social situation millions of people can be connected to any one entity,
so we have to have a scheme that won't fall over when we get to that case.

Any help would be greatly appreciated!

Thanks,
Todd

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/84b745c3-686a-4e9e-a02a-2816f90a23a1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

So clearly I need to RTFM. I missed this in the documentation the first
time.

Will filters at this scale be fast enough?

On Friday, October 3, 2014 11:48:40 AM UTC-6, Todd Nine wrote:

Hey guys,
We're currently storing entities and edges in Cassandra. The entities
are JSON, and edges are directed edges with a source---type-->target.
We're using Elasticsearch for indexing and I could really use a hand with
design.

What we're doing currently, is we take an entity, and turn it's JSON into
a document. We then create multiple copies of our document and change it's
type to match the index. For instance, Image the following use case.

bob(user) -- likes -- > Duo (restaurant) ===> Document Type = bob(user)

  • likes + restaurant ; bob(user) + likes

bob(user) -- likes -> Root Down (restaurant) ===> Document Type =
bob(user) + likes+ restaurant ; bob(user) + likes

bob(user) -- likes --> Coconut Porter (beer). ===> Document Types =
bob(user) + likes + beer; bob(user) + likes

When we index using this scheme we create 3 documents based on the
restaurants Duo and Root Down, and the beer Coconut Porter. We then store
this document 2x, one for it's specific type, and one in the "all" bucket.

Essentially, the document becomes a node in the graph. For each incoming
directed edge, we're storing 2x documents and changing the type. This
gives us fast seeks when we search by type, but a LOT of data bloat. Would
it instead be more efficient to keep an array of incoming edges in the
document, then add it to our search terms? For instance, should we instead
have a document like this?

docId: Duo(restaurant)

edges: [ "bob(user) + likes + restaurant", "bob(user) + likes" ]

When searching where edges = "bob(user) + likes + restaurant"?

I don't know internally what specifying type actually does, if it just
treats it as as field, or if it changes the routing of the response? In
a social situation millions of people can be connected to any one entity,
so we have to have a scheme that won't fall over when we get to that case.

Any help would be greatly appreciated!

Thanks,
Todd

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f97c6475-f4fc-4078-b052-b497ac82dc91%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Not sure if this helps but I use a variant of graphs in ES, it is called
Linked Data (JSON-LD)

By using JSON-LD, you can index something like

doc index: graph
doc type: relations
doc id: ...

{
"user" : {
"id" : "...",
"label" : "Bob",
"likes" : "restaurant:Duo"
}
}

for the statement "Bob likes restaurant Duo"

and then you can run ES queries on the field "likes" or better "user.likes"
for finding the users that like a restaurant etc. Referencing the "id" it
is possible to lookup another document in another index about "Bob".

Just to give an idea how you can model relations in structured ES JSON
objects.

Jörg

On Fri, Oct 3, 2014 at 7:59 PM, Todd Nine tnine@apigee.com wrote:

So clearly I need to RTFM. I missed this in the documentation the first
time.

Elasticsearch Platform — Find real-time answers at scale | Elastic

Will filters at this scale be fast enough?

On Friday, October 3, 2014 11:48:40 AM UTC-6, Todd Nine wrote:

Hey guys,
We're currently storing entities and edges in Cassandra. The entities
are JSON, and edges are directed edges with a source---type-->target.
We're using Elasticsearch for indexing and I could really use a hand with
design.

What we're doing currently, is we take an entity, and turn it's JSON into
a document. We then create multiple copies of our document and change it's
type to match the index. For instance, Image the following use case.

bob(user) -- likes -- > Duo (restaurant) ===> Document Type =
bob(user) + likes + restaurant ; bob(user) + likes

bob(user) -- likes -> Root Down (restaurant) ===> Document Type =
bob(user) + likes+ restaurant ; bob(user) + likes

bob(user) -- likes --> Coconut Porter (beer). ===> Document Types =
bob(user) + likes + beer; bob(user) + likes

When we index using this scheme we create 3 documents based on the
restaurants Duo and Root Down, and the beer Coconut Porter. We then store
this document 2x, one for it's specific type, and one in the "all" bucket.

Essentially, the document becomes a node in the graph. For each incoming
directed edge, we're storing 2x documents and changing the type. This
gives us fast seeks when we search by type, but a LOT of data bloat. Would
it instead be more efficient to keep an array of incoming edges in the
document, then add it to our search terms? For instance, should we instead
have a document like this?

docId: Duo(restaurant)

edges: [ "bob(user) + likes + restaurant", "bob(user) + likes" ]

When searching where edges = "bob(user) + likes + restaurant"?

I don't know internally what specifying type actually does, if it just
treats it as as field, or if it changes the routing of the response? In
a social situation millions of people can be connected to any one entity,
so we have to have a scheme that won't fall over when we get to that case.

Any help would be greatly appreciated!

Thanks,
Todd

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f97c6475-f4fc-4078-b052-b497ac82dc91%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f97c6475-f4fc-4078-b052-b497ac82dc91%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF0jKYVLKNV7RDjTCqsKnzjQmjZb%2BxBpkkGPa3YAHfM6A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Jorg,
Thanks for the response. I don't actually need to model the relationship
per se, more that a document is used in a relationship via a filter, then
search on it's properties. See the example below for more clarity.

Restaurant: => {name: "duo"}

Now, lets say I have 3 users,

George, Dave and Rod

George Dave and Rod all "like" the restaurant Duo. These are directed
edges from the user, of type "likes" to the "duo" document. We store these
edges in Cassandra. Envision the document looking something like this.

{
name: "duo",
openTime: 9,
closeTime: 18
_in_edges: [ "george/likes", "dave/likes", "rod/likes" ]
}

Then when searching, the user Dave would search something like this.

select * where closeTime < 16

Which we translate in to a query, which is then also filtered by _in_edges
= "dave/likes".

Our goal is to only create 1 document per node in our graph (in this
example restaurant), then possibly use the scripting API to add and remove
elements to the _in_edges fields and update the document. My only concern
around this is document size. It's not clear to me how to go about this
when we start getting millions of edges to that same target node, or
_in_edges field could grow to be millions of fields long. At that point,
is it more efficient to de-normalize and just turn "dave/likes",
"rod/likes", and "george/likes" into document types and store multiple
copies?

Thanks,
Todd

On Sat, Oct 4, 2014 at 2:52 AM, joergprante@gmail.com <joergprante@gmail.com

wrote:

Not sure if this helps but I use a variant of graphs in ES, it is called
Linked Data (JSON-LD)

By using JSON-LD, you can index something like

doc index: graph
doc type: relations
doc id: ...

{
"user" : {
"id" : "...",
"label" : "Bob",
"likes" : "restaurant:Duo"
}
}

for the statement "Bob likes restaurant Duo"

and then you can run ES queries on the field "likes" or better
"user.likes" for finding the users that like a restaurant etc. Referencing
the "id" it is possible to lookup another document in another index about
"Bob".

Just to give an idea how you can model relations in structured ES JSON
objects.

Jörg

On Fri, Oct 3, 2014 at 7:59 PM, Todd Nine tnine@apigee.com wrote:

So clearly I need to RTFM. I missed this in the documentation the first
time.

Elasticsearch Platform — Find real-time answers at scale | Elastic

Will filters at this scale be fast enough?

On Friday, October 3, 2014 11:48:40 AM UTC-6, Todd Nine wrote:

Hey guys,
We're currently storing entities and edges in Cassandra. The entities
are JSON, and edges are directed edges with a source---type-->target.
We're using Elasticsearch for indexing and I could really use a hand with
design.

What we're doing currently, is we take an entity, and turn it's JSON
into a document. We then create multiple copies of our document and change
it's type to match the index. For instance, Image the following use case.

bob(user) -- likes -- > Duo (restaurant) ===> Document Type =
bob(user) + likes + restaurant ; bob(user) + likes

bob(user) -- likes -> Root Down (restaurant) ===> Document Type =
bob(user) + likes+ restaurant ; bob(user) + likes

bob(user) -- likes --> Coconut Porter (beer). ===> Document Types =
bob(user) + likes + beer; bob(user) + likes

When we index using this scheme we create 3 documents based on the
restaurants Duo and Root Down, and the beer Coconut Porter. We then store
this document 2x, one for it's specific type, and one in the "all" bucket.

Essentially, the document becomes a node in the graph. For each
incoming directed edge, we're storing 2x documents and changing the type.
This gives us fast seeks when we search by type, but a LOT of data bloat.
Would it instead be more efficient to keep an array of incoming edges in
the document, then add it to our search terms? For instance, should we
instead have a document like this?

docId: Duo(restaurant)

edges: [ "bob(user) + likes + restaurant", "bob(user) + likes" ]

When searching where edges = "bob(user) + likes + restaurant"?

I don't know internally what specifying type actually does, if it just
treats it as as field, or if it changes the routing of the response? In
a social situation millions of people can be connected to any one entity,
so we have to have a scheme that won't fall over when we get to that case.

Any help would be greatly appreciated!

Thanks,
Todd

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f97c6475-f4fc-4078-b052-b497ac82dc91%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f97c6475-f4fc-4078-b052-b497ac82dc91%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/wtKQYcpb1-A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF0jKYVLKNV7RDjTCqsKnzjQmjZb%2BxBpkkGPa3YAHfM6A%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF0jKYVLKNV7RDjTCqsKnzjQmjZb%2BxBpkkGPa3YAHfM6A%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CA%2Byzqf9pw2YMtFDqjcH3QejL%3DF04dZVUaw1j5Jt8Nrd%3DxX3ZPw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Unfortunately, adding edges per update script will soon become expensive.
Note, updating is a multistep process of reading the doc, looking up the
field (often by fetching _source), and reindexing the whole(!) document
(not only the new edge) plus the versioning conflict management in case you
run concurrent updates. Also, this is the same procedure for removing an
edge. This is a huge difference to graph algorithms, where it is very cheap
to add/remove edges. Script updates will work to a certain extent quite
satisfactory, but you are right, if you want to add millions of edges to an
ES doc one by one, this will not be efficient.

So I would like to suggest to avoid the overhead of updating fields by
script in preference to add / remove relations by their "relation id", i.e.
to treat relations as first citizen docs. Adding millions of docs to an ES
index is cheaper than a million scripted updates on a single field.

Jörg

On Tue, Oct 7, 2014 at 1:23 AM, Todd Nine tnine@apigee.com wrote:

Hi Jorg,
Thanks for the response. I don't actually need to model the
relationship per se, more that a document is used in a relationship via a
filter, then search on it's properties. See the example below for more
clarity.

Restaurant: => {name: "duo"}

Now, lets say I have 3 users,

George, Dave and Rod

George Dave and Rod all "like" the restaurant Duo. These are directed
edges from the user, of type "likes" to the "duo" document. We store these
edges in Cassandra. Envision the document looking something like this.

{
name: "duo",
openTime: 9,
closeTime: 18
_in_edges: [ "george/likes", "dave/likes", "rod/likes" ]
}

Then when searching, the user Dave would search something like this.

select * where closeTime < 16

Which we translate in to a query, which is then also filtered by _in_edges
= "dave/likes".

Our goal is to only create 1 document per node in our graph (in this
example restaurant), then possibly use the scripting API to add and remove
elements to the _in_edges fields and update the document. My only concern
around this is document size. It's not clear to me how to go about this
when we start getting millions of edges to that same target node, or
_in_edges field could grow to be millions of fields long. At that point,
is it more efficient to de-normalize and just turn "dave/likes",
"rod/likes", and "george/likes" into document types and store multiple
copies?

Thanks,
Todd

On Sat, Oct 4, 2014 at 2:52 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Not sure if this helps but I use a variant of graphs in ES, it is called
Linked Data (JSON-LD)

By using JSON-LD, you can index something like

doc index: graph
doc type: relations
doc id: ...

{
"user" : {
"id" : "...",
"label" : "Bob",
"likes" : "restaurant:Duo"
}
}

for the statement "Bob likes restaurant Duo"

and then you can run ES queries on the field "likes" or better
"user.likes" for finding the users that like a restaurant etc. Referencing
the "id" it is possible to lookup another document in another index about
"Bob".

Just to give an idea how you can model relations in structured ES JSON
objects.

Jörg

On Fri, Oct 3, 2014 at 7:59 PM, Todd Nine tnine@apigee.com wrote:

So clearly I need to RTFM. I missed this in the documentation the first
time.

Elasticsearch Platform — Find real-time answers at scale | Elastic

Will filters at this scale be fast enough?

On Friday, October 3, 2014 11:48:40 AM UTC-6, Todd Nine wrote:

Hey guys,
We're currently storing entities and edges in Cassandra. The
entities are JSON, and edges are directed edges with a
source---type-->target. We're using Elasticsearch for indexing and I could
really use a hand with design.

What we're doing currently, is we take an entity, and turn it's JSON
into a document. We then create multiple copies of our document and change
it's type to match the index. For instance, Image the following use case.

bob(user) -- likes -- > Duo (restaurant) ===> Document Type =
bob(user) + likes + restaurant ; bob(user) + likes

bob(user) -- likes -> Root Down (restaurant) ===> Document Type =
bob(user) + likes+ restaurant ; bob(user) + likes

bob(user) -- likes --> Coconut Porter (beer). ===> Document Types =
bob(user) + likes + beer; bob(user) + likes

When we index using this scheme we create 3 documents based on the
restaurants Duo and Root Down, and the beer Coconut Porter. We then store
this document 2x, one for it's specific type, and one in the "all" bucket.

Essentially, the document becomes a node in the graph. For each
incoming directed edge, we're storing 2x documents and changing the type.
This gives us fast seeks when we search by type, but a LOT of data bloat.
Would it instead be more efficient to keep an array of incoming edges in
the document, then add it to our search terms? For instance, should we
instead have a document like this?

docId: Duo(restaurant)

edges: [ "bob(user) + likes + restaurant", "bob(user) + likes" ]

When searching where edges = "bob(user) + likes + restaurant"?

I don't know internally what specifying type actually does, if it just
treats it as as field, or if it changes the routing of the response? In
a social situation millions of people can be connected to any one entity,
so we have to have a scheme that won't fall over when we get to that case.

Any help would be greatly appreciated!

Thanks,
Todd

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f97c6475-f4fc-4078-b052-b497ac82dc91%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f97c6475-f4fc-4078-b052-b497ac82dc91%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/wtKQYcpb1-A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF0jKYVLKNV7RDjTCqsKnzjQmjZb%2BxBpkkGPa3YAHfM6A%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF0jKYVLKNV7RDjTCqsKnzjQmjZb%2BxBpkkGPa3YAHfM6A%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CA%2Byzqf9pw2YMtFDqjcH3QejL%3DF04dZVUaw1j5Jt8Nrd%3DxX3ZPw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CA%2Byzqf9pw2YMtFDqjcH3QejL%3DF04dZVUaw1j5Jt8Nrd%3DxX3ZPw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFO-Oc7Nt-8tav_qjmWjR1PPbbdA0jVpjfG_d5uNFV8Fw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hey Jorg,
Thanks for your response, it's very helpful. We're taking a similar
approach now, however it is slightly different. We're using Cassandra as
our edge storage, and we won't be moving to Elastic Search for a bit.
We're very operationally familiar with Cassandra, and Elastic Search is a
new beast to us. We're only going to use it for secondary indexing until
we become more comfortable with it, then we're going to transition more use
cases to it.

In our current implementation, entities contained within types of edges are
indexed by target type and and edge type into Elasticsearch. In the
example I gave above, we actually index 3 documents, all of which are the
same, except for the type to make seek more efficient.

{
docId: e1-v1
name: "duo",
openTime: 9,
closeTime: 18
_type: "george/likes"
}

{
docId: e1-v1
name: "duo",
openTime: 9,
closeTime: 18
_type: "dave/likes"
}

{
docId: e1-v1
name: "duo",
openTime: 9,
closeTime: 18
_type: "rod/likes"
}

We then search within the type of "dave/likes" for all restaurants dave
likes. We end up with a lot more documents this way, but we won't hit the
document size issues we've discussed. What sort of recommendations do you
feel we should have for shard size? Right now we're just sticking with the
default 10, with a replica count of 2, and we seem to be doing well.
Ultimately we're going to change our client to point to an alias, and add
more indexes behind the alias as we expand. Most apps will never need to
be more than 10 shards, a handful will need to expand into a few indexes.

Thoughts on this implementation?

Thanks,
Todd

On Tue, Oct 7, 2014 at 5:19 AM, joergprante@gmail.com <joergprante@gmail.com

wrote:

Unfortunately, adding edges per update script will soon become expensive.
Note, updating is a multistep process of reading the doc, looking up the
field (often by fetching _source), and reindexing the whole(!) document
(not only the new edge) plus the versioning conflict management in case you
run concurrent updates. Also, this is the same procedure for removing an
edge. This is a huge difference to graph algorithms, where it is very cheap
to add/remove edges. Script updates will work to a certain extent quite
satisfactory, but you are right, if you want to add millions of edges to an
ES doc one by one, this will not be efficient.

So I would like to suggest to avoid the overhead of updating fields by
script in preference to add / remove relations by their "relation id", i.e.
to treat relations as first citizen docs. Adding millions of docs to an ES
index is cheaper than a million scripted updates on a single field.

Jörg

On Tue, Oct 7, 2014 at 1:23 AM, Todd Nine tnine@apigee.com wrote:

Hi Jorg,
Thanks for the response. I don't actually need to model the
relationship per se, more that a document is used in a relationship via a
filter, then search on it's properties. See the example below for more
clarity.

Restaurant: => {name: "duo"}

Now, lets say I have 3 users,

George, Dave and Rod

George Dave and Rod all "like" the restaurant Duo. These are directed
edges from the user, of type "likes" to the "duo" document. We store these
edges in Cassandra. Envision the document looking something like this.

{
name: "duo",
openTime: 9,
closeTime: 18
_in_edges: [ "george/likes", "dave/likes", "rod/likes" ]
}

Then when searching, the user Dave would search something like this.

select * where closeTime < 16

Which we translate in to a query, which is then also filtered by
_in_edges = "dave/likes".

Our goal is to only create 1 document per node in our graph (in this
example restaurant), then possibly use the scripting API to add and remove
elements to the _in_edges fields and update the document. My only concern
around this is document size. It's not clear to me how to go about this
when we start getting millions of edges to that same target node, or
_in_edges field could grow to be millions of fields long. At that point,
is it more efficient to de-normalize and just turn "dave/likes",
"rod/likes", and "george/likes" into document types and store multiple
copies?

Thanks,
Todd

On Sat, Oct 4, 2014 at 2:52 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Not sure if this helps but I use a variant of graphs in ES, it is called
Linked Data (JSON-LD)

By using JSON-LD, you can index something like

doc index: graph
doc type: relations
doc id: ...

{
"user" : {
"id" : "...",
"label" : "Bob",
"likes" : "restaurant:Duo"
}
}

for the statement "Bob likes restaurant Duo"

and then you can run ES queries on the field "likes" or better
"user.likes" for finding the users that like a restaurant etc. Referencing
the "id" it is possible to lookup another document in another index about
"Bob".

Just to give an idea how you can model relations in structured ES JSON
objects.

Jörg

On Fri, Oct 3, 2014 at 7:59 PM, Todd Nine tnine@apigee.com wrote:

So clearly I need to RTFM. I missed this in the documentation the
first time.

Elasticsearch Platform — Find real-time answers at scale | Elastic

Will filters at this scale be fast enough?

On Friday, October 3, 2014 11:48:40 AM UTC-6, Todd Nine wrote:

Hey guys,
We're currently storing entities and edges in Cassandra. The
entities are JSON, and edges are directed edges with a
source---type-->target. We're using Elasticsearch for indexing and I could
really use a hand with design.

What we're doing currently, is we take an entity, and turn it's JSON
into a document. We then create multiple copies of our document and change
it's type to match the index. For instance, Image the following use case.

bob(user) -- likes -- > Duo (restaurant) ===> Document Type =
bob(user) + likes + restaurant ; bob(user) + likes

bob(user) -- likes -> Root Down (restaurant) ===> Document Type =
bob(user) + likes+ restaurant ; bob(user) + likes

bob(user) -- likes --> Coconut Porter (beer). ===> Document Types =
bob(user) + likes + beer; bob(user) + likes

When we index using this scheme we create 3 documents based on the
restaurants Duo and Root Down, and the beer Coconut Porter. We then store
this document 2x, one for it's specific type, and one in the "all" bucket.

Essentially, the document becomes a node in the graph. For each
incoming directed edge, we're storing 2x documents and changing the type.
This gives us fast seeks when we search by type, but a LOT of data bloat.
Would it instead be more efficient to keep an array of incoming edges in
the document, then add it to our search terms? For instance, should we
instead have a document like this?

docId: Duo(restaurant)

edges: [ "bob(user) + likes + restaurant", "bob(user) + likes" ]

When searching where edges = "bob(user) + likes + restaurant"?

I don't know internally what specifying type actually does, if it just
treats it as as field, or if it changes the routing of the response? In
a social situation millions of people can be connected to any one entity,
so we have to have a scheme that won't fall over when we get to that case.

Any help would be greatly appreciated!

Thanks,
Todd

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f97c6475-f4fc-4078-b052-b497ac82dc91%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f97c6475-f4fc-4078-b052-b497ac82dc91%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/wtKQYcpb1-A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF0jKYVLKNV7RDjTCqsKnzjQmjZb%2BxBpkkGPa3YAHfM6A%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF0jKYVLKNV7RDjTCqsKnzjQmjZb%2BxBpkkGPa3YAHfM6A%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CA%2Byzqf9pw2YMtFDqjcH3QejL%3DF04dZVUaw1j5Jt8Nrd%3DxX3ZPw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CA%2Byzqf9pw2YMtFDqjcH3QejL%3DF04dZVUaw1j5Jt8Nrd%3DxX3ZPw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/wtKQYcpb1-A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFO-Oc7Nt-8tav_qjmWjR1PPbbdA0jVpjfG_d5uNFV8Fw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFO-Oc7Nt-8tav_qjmWjR1PPbbdA0jVpjfG_d5uNFV8Fw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CA%2Byzqf-8g%2BWKpakqvxn9wdD6w47R2_XrZjTMwC13cArj-eiUTg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

If you create item-centric documents (in your case, venues) and maintain an
exhaustive list of users who like that item then this can be a problem for
very popular items e.g. movies etc that can be liked by large numbers of
people. It would need constantly updating.
By contrast, a user-centric document may be more manageable - each user
will have a finite list of the items they like held in their profile. There
are other benefits to a user-centric model in that you can use aggregations
to do item recommendations e.g. "people who liked item X also liked item Y".

To answer the question of testing the properties of a items liked by a user
(the opening times of the venues liked by dave) then you have 2 options:

  1. At the point of "liking" copy the item's properties into the
    user-centric document. This may be costly to alter if opening times change
    frequently and will probably require the use of nested docs on user profiles
  2. Index docs of the type "user-y-likes-item-x" and turn your query into a
    2-step operation - first retrieve all the items user X likes and then use
    this list in a filter of a query on item docs with the required opening
    times. The list of items liked by a single user is hopefully small/capped.

On Tuesday, October 7, 2014 12:24:03 AM UTC+1, Todd Nine wrote:

Hi Jorg,
Thanks for the response. I don't actually need to model the
relationship per se, more that a document is used in a relationship via a
filter, then search on it's properties. See the example below for more
clarity.

Restaurant: => {name: "duo"}

Now, lets say I have 3 users,

George, Dave and Rod

George Dave and Rod all "like" the restaurant Duo. These are directed
edges from the user, of type "likes" to the "duo" document. We store these
edges in Cassandra. Envision the document looking something like this.

{
name: "duo",
openTime: 9,
closeTime: 18
_in_edges: [ "george/likes", "dave/likes", "rod/likes" ]
}

Then when searching, the user Dave would search something like this.

select * where closeTime < 16

Which we translate in to a query, which is then also filtered by _in_edges
= "dave/likes".

Our goal is to only create 1 document per node in our graph (in this
example restaurant), then possibly use the scripting API to add and remove
elements to the _in_edges fields and update the document. My only concern
around this is document size. It's not clear to me how to go about this
when we start getting millions of edges to that same target node, or
_in_edges field could grow to be millions of fields long. At that point,
is it more efficient to de-normalize and just turn "dave/likes",
"rod/likes", and "george/likes" into document types and store multiple
copies?

Thanks,
Todd

On Sat, Oct 4, 2014 at 2:52 AM, joerg...@gmail.com <javascript:> <
joerg...@gmail.com <javascript:>> wrote:

Not sure if this helps but I use a variant of graphs in ES, it is called
Linked Data (JSON-LD)

By using JSON-LD, you can index something like

doc index: graph
doc type: relations
doc id: ...

{
"user" : {
"id" : "...",
"label" : "Bob",
"likes" : "restaurant:Duo"
}
}

for the statement "Bob likes restaurant Duo"

and then you can run ES queries on the field "likes" or better
"user.likes" for finding the users that like a restaurant etc. Referencing
the "id" it is possible to lookup another document in another index about
"Bob".

Just to give an idea how you can model relations in structured ES JSON
objects.

Jörg

On Fri, Oct 3, 2014 at 7:59 PM, Todd Nine <tn...@apigee.com <javascript:>

wrote:

So clearly I need to RTFM. I missed this in the documentation the first
time.

Elasticsearch Platform — Find real-time answers at scale | Elastic

Will filters at this scale be fast enough?

On Friday, October 3, 2014 11:48:40 AM UTC-6, Todd Nine wrote:

Hey guys,
We're currently storing entities and edges in Cassandra. The
entities are JSON, and edges are directed edges with a
source---type-->target. We're using Elasticsearch for indexing and I could
really use a hand with design.

What we're doing currently, is we take an entity, and turn it's JSON
into a document. We then create multiple copies of our document and change
it's type to match the index. For instance, Image the following use case.

bob(user) -- likes -- > Duo (restaurant) ===> Document Type =
bob(user) + likes + restaurant ; bob(user) + likes

bob(user) -- likes -> Root Down (restaurant) ===> Document Type =
bob(user) + likes+ restaurant ; bob(user) + likes

bob(user) -- likes --> Coconut Porter (beer). ===> Document Types =
bob(user) + likes + beer; bob(user) + likes

When we index using this scheme we create 3 documents based on the
restaurants Duo and Root Down, and the beer Coconut Porter. We then store
this document 2x, one for it's specific type, and one in the "all" bucket.

Essentially, the document becomes a node in the graph. For each
incoming directed edge, we're storing 2x documents and changing the type.
This gives us fast seeks when we search by type, but a LOT of data bloat.
Would it instead be more efficient to keep an array of incoming edges in
the document, then add it to our search terms? For instance, should we
instead have a document like this?

docId: Duo(restaurant)

edges: [ "bob(user) + likes + restaurant", "bob(user) + likes" ]

When searching where edges = "bob(user) + likes + restaurant"?

I don't know internally what specifying type actually does, if it just
treats it as as field, or if it changes the routing of the response? In
a social situation millions of people can be connected to any one entity,
so we have to have a scheme that won't fall over when we get to that case.

Any help would be greatly appreciated!

Thanks,
Todd

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f97c6475-f4fc-4078-b052-b497ac82dc91%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f97c6475-f4fc-4078-b052-b497ac82dc91%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/wtKQYcpb1-A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF0jKYVLKNV7RDjTCqsKnzjQmjZb%2BxBpkkGPa3YAHfM6A%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF0jKYVLKNV7RDjTCqsKnzjQmjZb%2BxBpkkGPa3YAHfM6A%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/05af17ff-e215-4823-8d12-ce83e54c50be%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.