Sorting Plugin Development


(davrob) #1

Hi,

I would like help in creating a plugin for sorting Nested properties, like
below:

"fullName": {"type": "string" },
"customColumns": {
"type": "nested",
"properties": {
"ccId": { "type": "string", "index": "not_analyzed" },
"value": { "type": "string", "analyzer": "customized"
},
"valueLC": { "type": "string", "index": "not_analyzed" }
}
}

There is one or zero occurrences of value (or ValueLC for sorting) for
each ccId per parent document.

So the process will be:

i) Do the search on other top level fields e.g. " fullName ".
ii) Sort the results according to the value associated a particular 

ccId (0 or 1 per parent document)

In pseudo-code the process would be:

results.sort("customColumns.value").nestedWhereClauseEquals("ccId", "520").

I haven't seen any sorting plugins, the nearest I have seen are faceting
plugins, any examples would be very welcome.

Since there are 5 shards per index, I can see this must be some kind of
map/reduce process to collect the sorted results from each shard, then do a
final sort.

  • David.

--


(phill) #2

On 8/23/2012 7:56 AM, davrob2 wrote:

Hi,

I would like help in creating a plugin for sorting Nested properties,
like below:

|
"fullName":{"type":"string"},
"customColumns":{
"type":"nested",
"properties":{
"ccId":{"type":"string","index":"not_analyzed"},
"value":{"type":"string","analyzer":"customized"},
"valueLC":{"type":"string","index":"not_analyzed"}
}
}
|

There is one or zero occurrences of value (or ValueLC for sorting) for
each ccId per parent document.

So the process will be:

i) Do the search on other top level fields e.g. " fullName ".
ii) Sort the results according to the value associated a 

particular ccId (0 or 1 per parent document)

In pseudo-code the process would be:

|
results.sort("customColumns.value").nestedWhereClauseEquals("ccId","520").
|

I haven't seen any sorting plugins, the nearest I have seen are
faceting plugins, any examples would be very welcome.

Since there are 5 shards per index, I can see this must be some kind
of map/reduce process to collect the sorted results from each shard,
then do a final sort.

  • David.
    --

I've been looking into various needs that include sorting. What I know
that I think is related is:
You can both sort on multiple fields and use a script to generate a
value to sort by.
http://www.elasticsearch.org/guide/reference/api/search/sort.html
see section "Script Based Sorting"
but as it says there using a custom_score_query can be quicker.
http://www.elasticsearch.org/guide/reference/query-dsl/custom-score-query.html
which can be used to generate just the right value to sort by, assuming
that would help.

In either case you can also register a Java "script", if you're not able
to tackle the problem of generating a special sort value with mval.

But what I see is query for documents where fullName=x and
customColumns.ccId=520 right?

I think your "nested where" must be an idea from SQL, I'm not enough of
an SQL expert to understand a where nested in a sort and how it differs
from just a general where for the selection or in ES case just in the query.

And I don't get what you mean when you specify ccId=520 and in #2 you
say you'll sort by that. Aren't you only getting ccID with a value of
520? I must be missing the use-case here.

Getting to the fields in a nested object is done with a nested query
which I haven't used.

But, if you just have one complex object, you can use path to it with
dots (which I find inadequately described on site), see
http://www.elasticsearch.org/guide/reference/query-dsl/field-query.html
which include "name.first".

Is all of your document? If so, I might consider storing the _source,
so that you can use just use _source.custom.ccId, but I don't know if
this appropriate for you.
I don't store source myself, but that might be an option if you just
want to specify the field in a script_field

-Paul

--


(davrob) #3

Hi Paul,

Yes - I'm familiar with script sorting - this is my script for sorting
custom columns https://gist.github.com/3486870

Unfortunately, as the JavaDoc on the class explains each document needs it
own score and is totally unaware of other docs, so I can't do a Comparator
style sort. 32 or, effectively, 24 bits worth of float data does not go
very var when try to discriminate one string from all possible other
strings in the universe. In actual fact, for 2 column sorting I only got 2
letters from each column before running out of float numbers.

The Lucene Collector, from my brief look at the source code, that collects
the results from scripts is a ScoreDocCollector - which just gets the score
(a float) from each doc and sorts according to the score.

I really need to be able to compare each string to the other and sort the
docs using a comparator - or some equivalent map-reduce style comparator,
that will be able to compare one document's value to another.

When you say "Aren't you only getting ccID with a value of 520?" - the
answer, really is no. What I am effectively doing is doing an outer join
(equivalent to) on this value, so my results contains the value associated
with ccId=520, plus lots of other data, so I can't just use a nested
filter, because that will get rid of all my other data.

  • David

On Thursday, August 23, 2012 7:49:28 PM UTC+1, P Hill wrote:

On 8/23/2012 7:56 AM, davrob2 wrote:

Hi,

I would like help in creating a plugin for sorting Nested properties,
like below:

|
"fullName":{"type":"string"},
"customColumns":{
"type":"nested",
"properties":{
"ccId":{"type":"string","index":"not_analyzed"},
"value":{"type":"string","analyzer":"customized"},
"valueLC":{"type":"string","index":"not_analyzed"}
}
}
|

There is one or zero occurrences of value (or ValueLC for sorting) for
each ccId per parent document.

So the process will be:

i) Do the search on other top level fields e.g. " fullName ". 
ii) Sort the results according to the value associated a 

particular ccId (0 or 1 per parent document)

In pseudo-code the process would be:

|

results.sort("customColumns.value").nestedWhereClauseEquals("ccId","520").

|

I haven't seen any sorting plugins, the nearest I have seen are
faceting plugins, any examples would be very welcome.

Since there are 5 shards per index, I can see this must be some kind
of map/reduce process to collect the sorted results from each shard,
then do a final sort.

  • David.
    --

I've been looking into various needs that include sorting. What I know
that I think is related is:
You can both sort on multiple fields and use a script to generate a
value to sort by.
http://www.elasticsearch.org/guide/reference/api/search/sort.html
see section "Script Based Sorting"
but as it says there using a custom_score_query can be quicker.

http://www.elasticsearch.org/guide/reference/query-dsl/custom-score-query.html
which can be used to generate just the right value to sort by, assuming
that would help.

In either case you can also register a Java "script", if you're not able
to tackle the problem of generating a special sort value with mval.

But what I see is query for documents where fullName=x and
customColumns.ccId=520 right?

I think your "nested where" must be an idea from SQL, I'm not enough of
an SQL expert to understand a where nested in a sort and how it differs
from just a general where for the selection or in ES case just in the
query.

And I don't get what you mean when you specify ccId=520 and in #2 you
say you'll sort by that. Aren't you only getting ccID with a value of
520? I must be missing the use-case here.

Getting to the fields in a nested object is done with a nested query
which I haven't used.

But, if you just have one complex object, you can use path to it with
dots (which I find inadequately described on site), see
http://www.elasticsearch.org/guide/reference/query-dsl/field-query.html
which include "name.first".

Is all of your document? If so, I might consider storing the _source,
so that you can use just use _source.custom.ccId, but I don't know if
this appropriate for you.
I don't store source myself, but that might be an option if you just
want to specify the field in a script_field

-Paul

--


(phill) #4

On 8/27/2012 2:35 AM, davrob2 wrote:

Hi Paul,

Yes - I'm familiar with script sorting - this is my script for sorting
custom columns https://gist.github.com/3486870

Unfortunately, as the JavaDoc on the class explains each document
needs it own score and is totally unaware of other docs, so I can't do
a Comparator style sort.

[...]

I really need to be able to compare each string to the other and sort
the docs using a comparator - or some equivalent map-reduce style
comparator, that will be able to compare one document's value to another.

Your description of AbstractFloatSearchScript,
AbstractDoubleSearchScript and AbstractLongSearchScript seems correct to
me, but the
the definition of a sort script shows a type.

"sort" : {
"_script" : {
"script" : "doc['field_name'].value * factor",
"type" : "number",
"params" : {
"factor" : 1.1
},
"order" : "asc"
}
}

-- http://www.elasticsearch.org/guide/reference/api/search/sort.html

That and the fact that any query can sort by other string fields
suggests that you ought to be able to sort by a calculated "string" value.
Have you tried that?
NativeScriptEngineService calls ExecutableScript.run() directly, so I'm
thinking there might be a way forward for you there.
Note: I'm just exploring this area myself, so don't take my word for it,
but maybe is a way forward for you (and maybe me too).

-Paul

--


(phill) #5

Maybe the way out of your conundrum is to redesign your document.

Looking at it another way, if you want a result
"fullname", "ccId", "valueLC" sorted by valueLC
where
"There is one or zero occurrences of value (or ValueLC for sorting) for
each ccId per parent document."
design a document that is flattened which repeats some of data
("fullname" for each combination of ccId+value) instead of a nested
customColumns of ccId, value, valueLC.
|
"fullName":{"type":"string"},
"customColumns":{
"type":"nested",
"properties":{
"ccId":{"type":"string","index":"not_analyzed"},
"value":{"type":"string","analyzer":"customized"},
"valueLC":{"type":"string","index":"not_analyzed"}
}
}
|

|
{
...
"fullName":{"type":"string"},
"ccId":{"type":"string","index":"not_analyzed"},
"value":{"type":"string","analyzer":"customized"},
"valueLC":{"type":"string","index":"not_analyzed"}
}
|

Of course, if the top-level doc (the one containing fullname) contains
"megs" of data what you don't want to repeat in this top-level document
that would not be the simplest solution.
For a very similar problem I'm having, (after a suggestion on this list,
and discussion of child docs and nested docs) I'm wondering if the right
solution for me is to filter, query and sort a smaller document that
only contains a only a reference to the "megs" of data while everything
else is repeated in all occurrences for easier filtering and sorting.

|
"denormalizedCustomerColumns" {
...
"fullName":{"type":"string"},
"ccId":{"type":"string","index":"not_analyzed"},
"value":{"type":"string","analyzer":"customized"},
"valueLC":{"type":"string","index":"not_analyzed"}
"megsOfData":{

     "_parent"  :  {
         "type"  :  "hugeType"
     }
 }

}
|

I've shown it here as a parent reference, even though in your case it
looks to be 1:1. It could be some other field that can be used to get to
the rest of the data, for example if fullName is actually a unique key.

some discussion here:
http://www.spacevatican.org/2012/6/3/fun-with-elasticsearch-s-children-and-nested-documents/
This article's second sentence is:
"Sometimes, you’re better off denormalizing all data into the child
documents." which is a shorter way to say all I've suggested above.

-Paul

--


(system) #6