Hi,
I've tested ES for a while. It reaches almost all my needs.
However I have a particular multiple JSON representation which is almost a
Tree, but indeed a DAG (directed acyclic graph), which means one child may
have more than 1 parent, even if few (and of course one parent could have
many children). The "nodes" are all of the same type, and involved more
than billions of nodes, while the roots of the tree (dag) are a few (around
100).
-
I already understand that "_parent" is not the suitable solution since
it can only stand for 1 parent per child. -
I cannot used embedded document (children being within parent document)
since it is a tree, so doing this will lead to the root containing "all"
elements (billions) which is not satisfactory obviously.
ES is really impressive for complex requests (involving almost all nodes),
with a real improvement compared to time using the same kind of queries on
the primary data backend I use, specially when the request concerns
traversal of the tree without almost considering the tree side, think about
it as "google" search.
(the data backend is MongoDB, but as I want to use ES for the complex
queries, I try to find the best way to get the best performance with ES).
However in some cases, I have to go from the root to the leaf, step by
step, due to query on each step in the depth traversal, which is really
different than "google" search since it is a step by step query.
Let say each node as some fields like "name", "property", "number", and
when sitting on one node, going to the next one (children) will be as :
among my children, who is compliant with the query based on "name",
"property", "number" ?
In fact, the most efficient query is reversly done :
query = who is compliant with the query based on "name", "property",
"number", filter = children having this_parent in the "parents" list field
With 1 million nodes for testing, with 6 levels (depth of the tree is 6),
it took me 40ms, while doing the same with the data backend with "simple"
request only (match as equal on data backend) is done in 9ms (with exactly
the same logic for requesting, so 6 requests each). Note also that on the
backend there is no index on queried fields, except the "parents" field.
When increasing the number of concurrent clients (from 1 to 32 in stress
mode, meaning no wait between 2 queries), on ES the time goes up to 100ms,
while on the data backend, it goes up to 12ms.
So I was thinking and thinking again how to deal with this.
I even try to make the search part using both sides,
a) at one step getting the list of children,
b) using "idsFilter" on those ids to find the right one with the "field
query",
a) then getting back the list of children again as in (a) before continue
to b
and repeat this 5 times (5 levels, since the first level is done at once on
root directly and getting the children list)
The time was even worst (50 to 250 ms).
Then I realized that what I would like to do is a request with inner
request. Let me show the idea :
Note : parents field is : parents : { id1 : depth1, id2 : depth2, ... }
where it means that for the current node, id1 is a parent at depth1
distance.
Level 1: query { match : { name : value1 } } / filter { isroot : 1 }
Level 2: query { match : { name : value2 } } / filter { bool : { should {
id1 : 1 }, should { id2 : 1 }, ... } } where idx are ids of parent of Level
1
Note : maybe there could be another way to find a better filter with
something like "multi_match" but applied to Filter, because on query it
will express as :
multi_match : { query : 1, fields : [ list of ids from parent ] }, which is
much shorter.
*So question 1) *could it be possible to have a filter that could express
the same that multi_match but as Filter (named multi_match_filter ?) ?
So lets say we use now query and not filter :
Level 1: query { match : { name : value1, isroot : 1 } }
Level 2: query { bool : {must : { match : { name : value2 } }, multi_match
: { query : 1, fields : [ list of ids from parent L1 ] } } }
Could it be possible to have a way to express the same in one step as the
following ?
query { bool : {must : { match : { name : value6 } }, multi_match : { query
: 1, fields : [ subquery { results : _uid, match : { name : value1,
isroot : 1 } } ] } } }
The idea is to introduce something named subquery that is an inner query
(same rules than standard query) but with an added items mandatory :
"results" which express what is the field to retrieve for the outer request
(here the "_uid") ?
*Question 2) *do you think it might be a good idea as a new feature ?
I saw on :
https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/DE9bKaCRgGw
that guys suggest to implement this through a plugin. However, when I
looked into the plugin architecture, I really did not see how to do it.
Question 3) Do you have any suggestion ?
Best regards,
Frederic
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.