Sorting docs as input to facet phase


(Tikitu de Jager) #1

Hi folks,

I'm building a custom facet that would benefit greatly if I could feed it
its documents in a predefined order: the space requirements are much smaller
if I can guarantee that all documents that share the same value on a
particular field pass through the facet collector in one bunch.

I.e., this ordering is cheap (grouping by "tweet"):

{"tweet": 3, "label": 5}
{"tweet": 3, "label": 7}
{"tweet": 4, "label": 3}
{"tweet": 4: "label": 5}

but this is expensive:

{"tweet": 3, "label": 5}
{"tweet": 4, "label": 3}
{"tweet": 3, "label": 7}
{"tweet": 4: "label": 5}

(I'm only interested in aggregate statistics across all "tweet" values, but
I can't calculate the per-tweet value until I'm sure no more labels are
coming -- the actual case is somewhat more complicated, and involves some
timestamp calculations as well, but I think that's irrelevant.)

Is there any way to achieve this? I'm thinking maybe using nested documents
("tweet" is actually a parent-doc ID, but I was hoping to use parent/child
docs to avoid the reindexing requirement).

Regards,
Tikitu

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Tikitu de Jager) #2

In case anyone finds this in the archives: as far as I can see this is
basically not possible at all with parent/child docs (at least without
reimplementing sorting yourself).

Nested docs, on the other hand, guarantee that the parent and children are
adjacent in the same segment; it's quite easy to make a custom Collector
which hands off both the parent and child docIds to (specialised versions
of) doCollect():

/**
 * Modified from 

org.elasticsearch.index.search.nested.NestedChildrenCollector.java
*
* Collect root doc first, then all nested docs; send them to different
methods.
*/
@Override
public void collect(int parentDoc) throws IOException {
if (parentDoc == 0 || parentDocs == null) {
return;
}
doCollectParent(parentDoc);
int prevParentDoc = parentDocs.prevSetBit(parentDoc - 1);
for (int i = (parentDoc - 1); i > prevParentDoc; i--) {
if (!currentReader.isDeleted(i) && childDocs.get(i)) {
doCollectChild(i);
}
}
}

On Tuesday, 3 September 2013 16:31:39 UTC+3, Tikitu de Jager wrote:

Hi folks,

I'm building a custom facet that would benefit greatly if I could feed it
its documents in a predefined order: the space requirements are much smaller
if I can guarantee that all documents that share the same value on a
particular field pass through the facet collector in one bunch.

I.e., this ordering is cheap (grouping by "tweet"):

{"tweet": 3, "label": 5}
{"tweet": 3, "label": 7}
{"tweet": 4, "label": 3}
{"tweet": 4: "label": 5}

but this is expensive:

{"tweet": 3, "label": 5}
{"tweet": 4, "label": 3}
{"tweet": 3, "label": 7}
{"tweet": 4: "label": 5}

(I'm only interested in aggregate statistics across all "tweet" values,
but I can't calculate the per-tweet value until I'm sure no more labels are
coming -- the actual case is somewhat more complicated, and involves some
timestamp calculations as well, but I think that's irrelevant.)

Is there any way to achieve this? I'm thinking maybe using nested
documents ("tweet" is actually a parent-doc ID, but I was hoping to use
parent/child docs to avoid the reindexing requirement).

Regards,
Tikitu

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #3