Opinions on ES with highly related data

knacktus · February 16, 2013, 2:47am

Hi guys,

I've just discovered the potential of ES as scalable multi-purpose cache or
even only data store. So far, I've been using RDBMS with MemcacheD or Redis
(for simple queries in the application layer). I've decided to give ES a
try by building a Prototype, but before I dive in I'd much appreciate your
opinions about how I plan to get the data from ES.

The issue might be that my data is highly related and I need to work mainly
with large structures. ES's main task in this regard would be to support a
server process collecting all data items which are within the large
structures. These items are send to a rich client, where the actual
structured views are build.

Here's some example data:

root = {id = 1,
name = "Plane",
subassemblies = [2, 3, 4]}

body = {id = 2,
name = "Body",
subassemblies = [5, 6]}

left_wing = {id = 3,
name = "Wing"
subassemlies = []}

right_wing = {id = 4,
name = "Wing"
subassemlies = []}

uppder_body_structure = {id = 5,
name = "Upper Body"
subassemlies = []}

lower_body_structure = {id = 6,
name = "Lower Body"
subassemlies = []}

So, I would query ES iteratively to get all items, starting with the root
item. About like this in Python pseudocode:

all_item_ids = []
current_root_id = 1
all_item_ids.append(current_root_id)
current_item_ids = [1]

while len(current_item_ids) > 0:
current_item_ids =
query_ES_for_items_by_given_ids_and_return_given_field(current_item_ids,
"subassemlies") # here would come some more advanced query options
all_item_ids.extend(current_item_ids)

send_ids_to_client(all_item_ids) # there's a client cache for the item
data, so I send the ids only

The amount of data is quite large. Up to 100000 rows with up to 50 levels.
So I would possibly end up with queries with 10000 arguments (however only
exact matches need to be considered). Those could be split up into batches,
but that's where I hope to get your opinions. (Hitting ES 50 times wouldn't
make me nervous, but when it comes to a couple of thousand times, something
seems not right. But then, if it took only a couple of seconds overall, I
wouldn't complain :-)).

Is this the right approach to handle large structures? Do you see any
general showstoppers or flaws? (Like limits in query-size ...)

Another question is about storing Thrift oder Protocol Buffers encoded
data. How would you store those for simple get, mget operations? (Those
formats are used for transport and in the client cache, which is basically
a key-value store.)

On top of that I would use fulltext search and general combinded searches
within the whole data. But I have no doubt that ES is the right choice
there. So, if I'd be able to retrieve the structured data in a performant
way, ES would be an awesome powerfull all-in-one solution.

Cheers and thanks for any comments and opinions,

Jan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

phill · February 20, 2013, 3:55pm

ES is for storing documents it appears what you are trying to do is
store a hierarchy as you would represent it in a RDMS. Try to build the
document that you want to retrieve, because that is what you will get,
one or more documents. I doubt you ever want the query to return just
body ={id =2,
name ="Body",
subassemblies =[5,6]}

-Paul

On 2/15/2013 6:47 PM, knacktus@googlemail.com wrote:

Hi guys,

I've just discovered the potential of ES as scalable multi-purpose
cache or even only data store. So far, I've been using RDBMS with
MemcacheD or Redis (for simple queries in the application layer). I've
decided to give ES a try by building a Prototype, but before I dive in
I'd much appreciate your opinions about how I plan to get the data
from ES.

The issue might be that my data is highly related and I need to work
mainly with large structures. ES's main task in this regard would be
to support a server process collecting all data items which are within
the large structures. These items are send to a rich client, where the
actual structured views are build.

Here's some example data:

||
root ={id =1,
name ="Plane",
subassemblies =[2,3,4]}

body ={id =2,
name ="Body",
subassemblies =[5,6]}

left_wing ={id =3,
name ="Wing"
subassemlies =}

|right_wing ={id =4,
name ="Wing"
subassemlies =}

||uppder_body_structure ={id =5,
name ="Upper Body"
subassemlies =}

|||lower_body_structure ={id =6,
name ="Lower Body"
subassemlies =}||

So, I would query ES iteratively to get all items, starting with the
root item. About like this in Python pseudocode:

||
all_item_ids =
current_root_id = 1
all_item_ids.append(current_root_id)
current_item_ids = [1]

while len(current_item_ids) > 0:
current_item_ids =
query_ES_for_items_by_given_ids_and_return_given_field(current_item_ids,
"subassemlies") # here would come some more advanced query options
all_item_ids.extend(current_item_ids)

send_ids_to_client(all_item_ids) # there's a client cache for the item
data, so I send the ids only

The amount of data is quite large. Up to 100000 rows with up to 50
levels. So I would possibly end up with queries with 10000 arguments
(however only exact matches need to be considered). Those could be
split up into batches, but that's where I hope to get your opinions.
(Hitting ES 50 times wouldn't make me nervous, but when it comes to a
couple of thousand times, something seems not right. But then, if it
took only a couple of seconds overall, I wouldn't complain :-)).

Is this the right approach to handle large structures? Do you see any
general showstoppers or flaws? (Like limits in query-size ...)

Another question is about storing Thrift oder Protocol Buffers encoded
data. How would you store those for simple get, mget operations?
(Those formats are used for transport and in the client cache, which
is basically a key-value store.)

On top of that I would use fulltext search and general combinded
searches within the whole data. But I have no doubt that ES is the
right choice there. So, if I'd be able to retrieve the structured data
in a performant way, ES would be an awesome powerfull all-in-one solution.

Cheers and thanks for any comments and opinions,

Jan

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.