Inner query within ES query through plugin?

Hi,

I've tested ES for a while. It reaches almost all my needs.
However I have a particular multiple JSON representation which is almost a
Tree, but indeed a DAG (directed acyclic graph), which means one child may
have more than 1 parent, even if few (and of course one parent could have
many children). The "nodes" are all of the same type, and involved more
than billions of nodes, while the roots of the tree (dag) are a few (around
100).

  1. I already understand that "_parent" is not the suitable solution since
    it can only stand for 1 parent per child.

  2. I cannot used embedded document (children being within parent document)
    since it is a tree, so doing this will lead to the root containing "all"
    elements (billions) which is not satisfactory obviously.

ES is really impressive for complex requests (involving almost all nodes),
with a real improvement compared to time using the same kind of queries on
the primary data backend I use, specially when the request concerns
traversal of the tree without almost considering the tree side, think about
it as "google" search.
(the data backend is MongoDB, but as I want to use ES for the complex
queries, I try to find the best way to get the best performance with ES).

However in some cases, I have to go from the root to the leaf, step by
step, due to query on each step in the depth traversal, which is really
different than "google" search since it is a step by step query.

Let say each node as some fields like "name", "property", "number", and
when sitting on one node, going to the next one (children) will be as :
among my children, who is compliant with the query based on "name",
"property", "number" ?

In fact, the most efficient query is reversly done :
query = who is compliant with the query based on "name", "property",
"number", filter = children having this_parent in the "parents" list field

With 1 million nodes for testing, with 6 levels (depth of the tree is 6),
it took me 40ms, while doing the same with the data backend with "simple"
request only (match as equal on data backend) is done in 9ms (with exactly
the same logic for requesting, so 6 requests each). Note also that on the
backend there is no index on queried fields, except the "parents" field.
When increasing the number of concurrent clients (from 1 to 32 in stress
mode, meaning no wait between 2 queries), on ES the time goes up to 100ms,
while on the data backend, it goes up to 12ms.

So I was thinking and thinking again how to deal with this.
I even try to make the search part using both sides,
a) at one step getting the list of children,
b) using "idsFilter" on those ids to find the right one with the "field
query",
a) then getting back the list of children again as in (a) before continue
to b
and repeat this 5 times (5 levels, since the first level is done at once on
root directly and getting the children list)

The time was even worst (50 to 250 ms).

Then I realized that what I would like to do is a request with inner
request. Let me show the idea :

Note : parents field is : parents : { id1 : depth1, id2 : depth2, ... }
where it means that for the current node, id1 is a parent at depth1
distance.

Level 1: query { match : { name : value1 } } / filter { isroot : 1 }
Level 2: query { match : { name : value2 } } / filter { bool : { should {
id1 : 1 }, should { id2 : 1 }, ... } } where idx are ids of parent of Level
1

Note : maybe there could be another way to find a better filter with
something like "multi_match" but applied to Filter, because on query it
will express as :
multi_match : { query : 1, fields : [ list of ids from parent ] }, which is
much shorter.

*So question 1) *could it be possible to have a filter that could express
the same that multi_match but as Filter (named multi_match_filter ?) ?

So lets say we use now query and not filter :

Level 1: query { match : { name : value1, isroot : 1 } }
Level 2: query { bool : {must : { match : { name : value2 } }, multi_match
: { query : 1, fields : [ list of ids from parent L1 ] } } }

Could it be possible to have a way to express the same in one step as the
following ?

query { bool : {must : { match : { name : value6 } }, multi_match : { query
: 1, fields : [ subquery { results : _uid, match : { name : value1,
isroot : 1 } } ] } } }

The idea is to introduce something named subquery that is an inner query
(same rules than standard query) but with an added items mandatory :
"results" which express what is the field to retrieve for the outer request
(here the "_uid") ?

*Question 2) *do you think it might be a good idea as a new feature ?

I saw on :
https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/DE9bKaCRgGw
that guys suggest to implement this through a plugin. However, when I
looked into the plugin architecture, I really did not see how to do it.

Question 3) Do you have any suggestion ?

Best regards,
Frederic

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Just as Otis and Igor noted, you have to invent a kind of a query
template language. In that template language, you have to define a
well-defined semantics for substituting variables with query results,
and then you can execute nested queries.

The plugin implementation would be twofold, on the one hand the template
language implementation, and on the other hand the ES implemenation for
the new "action" - you can extend ES by a plugin that adds
request/response classes which can operate on index/shard/custom level.
Most feasible would be to adapt the MoreLikeThis action because this
action already performs subsequent query style, but without templates,
instead it takes the source of a document for query construction.

Jörg

Am 28.03.13 22:49, schrieb Frederic Brégier:

I saw on :
https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/DE9bKaCRgGw
https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/DE9bKaCRgGw
that guys suggest to implement this through a plugin. However, when
I looked into the plugin architecture, I really did not see how to do it.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

Thanks, I will follow this suggestion and experiments. If it works, I could
come back with a plugin...

On the question 1, did I missed something that acts as multi_match but on
filter side ?

Best regards,
Frederic

Le vendredi 29 mars 2013 00:57:26 UTC+1, Jörg Prante a écrit :

Just as Otis and Igor noted, you have to invent a kind of a query
template language. In that template language, you have to define a
well-defined semantics for substituting variables with query results,
and then you can execute nested queries.

The plugin implementation would be twofold, on the one hand the template
language implementation, and on the other hand the ES implemenation for
the new "action" - you can extend ES by a plugin that adds
request/response classes which can operate on index/shard/custom level.
Most feasible would be to adapt the MoreLikeThis action because this
action already performs subsequent query style, but without templates,
instead it takes the source of a document for query construction.

Jörg

Am 28.03.13 22:49, schrieb Frederic Brégier:

I saw on :

https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/DE9bKaCRgGw

<
https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/DE9bKaCRgGw>

that guys suggest to implement this through a plugin. However, when
I looked into the plugin architecture, I really did not see how to do
it.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

Before I try to go deeper in ES coding, which is not so easy at first step
as usual (I need to understand the overall logic), I have one question (at
the end) :

  • considering 2 queries from client side as :
    a) { query { condition1 }
    b) { query { condition2a, condition2b using [idx] } } where idx are ids of
    parent of query a

  • changing to the following idea in one step :

{ query { condition2a, condition2b using [ subquery { results : _uid,
condition1 } ] } } }

where subquery could be a new option of query, which in fact as a query,
but returning the result fields as specified in "results" (here _uid) in an
array.
I understand that this not the way you propose (correct ?).
In fact, it looks like as "select from index where condition2a AND
condition2b using ( select _uid from index where condition1 )

  • But more like the following :

_subq?subq_out=_uid&subq_in=result&index=idx

with body close to multi_search

{ query { condition1 } }
{ query { condition2a, condition2b using result } }

where condition2b could be for instance
multi_match : { query : 1, fields : *result *}

using the idea of create a new Action/Language based on principle of
MultiSearch, more than on MoreLikeThis, for a query named subq with the
following parameters :

  • subq_out = field name that will be used for final result of the current
    request (default = _uid, but could be any field but one)
  • subq_in = name of the "value" that will be replaced in the next query
    (string.replace almost) by the array of results ("result" replaced by _uid
    array by default, or by an array of values of type within previous field)

If more than 2 queries are specified, it can still be used with the same
logic (next query will replace "result" string by the _uid list of 2nd
query, and so on). We can of course imagine to specify between each request
: index, and suq_out, subq_in in order to change the out/in relation.

{ index : idx1, subq_out : _uid} // index and out
apply to first request
{ query { condition1 } }
{ subq_in : result } // in
applies to second request
{ index : idx2, subq_out : fieldname, } // index and out
apply to second request
{ query { condition2a, condition2b using result } }
{ subq_in : result2 } // in
applies to third request
{ query { condition3a, condition3b using result2 } }

So my question is : will it be more efficient to make such a request than
the 2 previous ?

Since, if I understand how this could work, in fact, it will execute the
first request, then the second, so giving the exact same time than making 2
queries sequentially from client ?
Or do those sequential requests will be more efficiently done due to less
traffic ?
I believe not, since ES will have to query potentially all nodes (to fetch
the data), therefore having network traffic. Ok, this traffic will only be
between nodes, but not sure if there is any improvement to get here.

What do you think ?

Also, I know you say using MoreLikeThis is probably the better, but I
believe the logic is more close to the MultiSearch for external point of
view, but agree that on internal side, it could be as MLT on sequential
queries... so it is probably a mix of them. But I may have totally
misunderstood.

Cheers,
Frederic

Le vendredi 29 mars 2013 08:38:38 UTC+1, Frederic Brégier a écrit :

Hi,

Thanks, I will follow this suggestion and experiments. If it works, I
could come back with a plugin...

On the question 1, did I missed something that acts as multi_match but on
filter side ?

Best regards,
Frederic

Le vendredi 29 mars 2013 00:57:26 UTC+1, Jörg Prante a écrit :

Just as Otis and Igor noted, you have to invent a kind of a query
template language. In that template language, you have to define a
well-defined semantics for substituting variables with query results,
and then you can execute nested queries.

The plugin implementation would be twofold, on the one hand the template
language implementation, and on the other hand the ES implemenation for
the new "action" - you can extend ES by a plugin that adds
request/response classes which can operate on index/shard/custom level.
Most feasible would be to adapt the MoreLikeThis action because this
action already performs subsequent query style, but without templates,
instead it takes the source of a document for query construction.

Jörg

Am 28.03.13 22:49, schrieb Frederic Brégier:

I saw on :

https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/DE9bKaCRgGw

<
https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/DE9bKaCRgGw>

that guys suggest to implement this through a plugin. However, when
I looked into the plugin architecture, I really did not see how to do
it.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

Except if I really misunderstood, I believe there is no gain for that
approach since indeed this is all to the client to make the job.
I explain what I undrestood :

  • With MoreLikeThis (using Java client), the client makes all queries (the
    first to get the object to refer to, then the second to get all mlt
    objects), which means all traffic is really between the client and the
    servers, not between the servers (only for the result of each request of
    course).
  • With the approach I tried using MultipleSearch model, but acting
    sequential requests instead of parallel ones (since the next request
    depends on the previous), as the model is the same than with MLT, all
    requests are going out of the client, one by one.

Therefore, making it myself in my code will give the very same result than
implementing a MultipleSearch with additional parameters.

For your information, I was almost at the point to get it working. I extend
MultipleSearch to add 2 parameters : Out (what is the field to retrieve, by
default _uid) and In (the value to replace in the next request by the list
of Out values). It almst works, except that I struggle with the available
information when the In should be replaced by Out values, since I just have
the "source" object which is already prepared to be sent (so not
modifiable). But doing this shows me that only the client was acting, the
server was totally ignorant of those constructions. I was on my way to add
more information to the MultipleSearch, in order to have the possibility to
have the "String" native source of the request, not in binary format.

So I still have 3 questions :
a) Did I have correctly understood that all requests are generated from
client side, and not within the server side using MLT or MS ?

b) Is there a way to move the "request" building on server side, and not on
client side ?

c) Is there a way to retrieve the source of the request, in order that I
can change the request dynamically (here string.replace(In,
Out.values_in_list_format)) ?

Cheers,
Frederic

Le samedi 30 mars 2013 00:49:06 UTC+1, Frederic Brégier a écrit :

Hi,

Before I try to go deeper in ES coding, which is not so easy at first step
as usual (I need to understand the overall logic), I have one question (at
the end) :

  • considering 2 queries from client side as :
    a) { query { condition1 }
    b) { query { condition2a, condition2b using [idx] } } where idx are ids
    of parent of query a

  • changing to the following idea in one step :

{ query { condition2a, condition2b using [ subquery { results : _uid,
condition1 } ] } } }

where subquery could be a new option of query, which in fact as a query,
but returning the result fields as specified in "results" (here _uid) in an
array.
I understand that this not the way you propose (correct ?).
In fact, it looks like as "select from index where condition2a AND
condition2b using ( select _uid from index where condition1 )

  • But more like the following :

_subq?subq_out=_uid&subq_in=result&index=idx

with body close to multi_search

{ query { condition1 } }
{ query { condition2a, condition2b using result } }

where condition2b could be for instance
multi_match : { query : 1, fields : *result *}

using the idea of create a new Action/Language based on principle of
MultiSearch, more than on MoreLikeThis, for a query named subq with the
following parameters :

  • subq_out = field name that will be used for final result of the
    current request (default = _uid, but could be any field but one)
  • subq_in = name of the "value" that will be replaced in the next query
    (string.replace almost) by the array of results ("result" replaced by _uid
    array by default, or by an array of values of type within previous field)

If more than 2 queries are specified, it can still be used with the same
logic (next query will replace "result" string by the _uid list of 2nd
query, and so on). We can of course imagine to specify between each request
: index, and suq_out, subq_in in order to change the out/in
relation.

{ index : idx1, subq_out : _uid} // index and out
apply to first request
{ query { condition1 } }
{ subq_in : result } // in
applies to second request
{ index : idx2, subq_out : fieldname, } // index and out
apply to second request
{ query { condition2a, condition2b using result } }
{ subq_in : result2 } // in
applies to third request
{ query { condition3a, condition3b using result2 } }

So my question is : will it be more efficient to make such a request than
the 2 previous ?

Since, if I understand how this could work, in fact, it will execute the
first request, then the second, so giving the exact same time than making 2
queries sequentially from client ?
Or do those sequential requests will be more efficiently done due to less
traffic ?
I believe not, since ES will have to query potentially all nodes (to fetch
the data), therefore having network traffic. Ok, this traffic will only be
between nodes, but not sure if there is any improvement to get here.

What do you think ?

Also, I know you say using MoreLikeThis is probably the better, but I
believe the logic is more close to the MultiSearch for external point of
view, but agree that on internal side, it could be as MLT on sequential
queries... so it is probably a mix of them. But I may have totally
misunderstood.

Cheers,
Frederic

Le vendredi 29 mars 2013 08:38:38 UTC+1, Frederic Brégier a écrit :

Hi,

Thanks, I will follow this suggestion and experiments. If it works, I
could come back with a plugin...

On the question 1, did I missed something that acts as multi_match but on
filter side ?

Best regards,
Frederic

Le vendredi 29 mars 2013 00:57:26 UTC+1, Jörg Prante a écrit :

Just as Otis and Igor noted, you have to invent a kind of a query
template language. In that template language, you have to define a
well-defined semantics for substituting variables with query results,
and then you can execute nested queries.

The plugin implementation would be twofold, on the one hand the template
language implementation, and on the other hand the ES implemenation for
the new "action" - you can extend ES by a plugin that adds
request/response classes which can operate on index/shard/custom level.

Most feasible would be to adapt the MoreLikeThis action because this
action already performs subsequent query style, but without templates,
instead it takes the source of a document for query construction.

Jörg

Am 28.03.13 22:49, schrieb Frederic Brégier:

I saw on :

https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/DE9bKaCRgGw

<
https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/DE9bKaCRgGw>

that guys suggest to implement this through a plugin. However, when
I looked into the plugin architecture, I really did not see how to do
it.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.