How to find docs containing a large number of nested docs?

Hi folks,

We're using nested docs to maintain a record of actions that were performed
on our primary docs: a field "actions" contains a list of action docs which
have some metadata (what action was performed, by who, timestamp, etc), and
they're indexed as nested docs.

Recently we had an automated process run amok, and it's applied huge
numbers of actions to some of the docs. I'd like to find which docs those
are, to clean up the mess. Ideally I'd like to rank (primary) docs by
number-of-nested-actions, but a simple filter ("all docs with more than
1000 actions") would be good enough too. But I don't see an obvious way to
do either. Any tips?

Cheers,
Tikitu

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/acafe2f5-57de-4397-a2aa-4ebd637e62fb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You could use a script on the nested array field and use the "...
actions.size()" length of the array itself, and use that either in sorting,
or in a script facet. Both will be slow and cpu-intensive, but at least you
could get the set out? Is that what you need?

Anne

On Thu, Sep 25, 2014 at 11:19 AM, Tikitu de Jager tikitu@buzzcapture.com
wrote:

Hi folks,

We're using nested docs to maintain a record of actions that were
performed on our primary docs: a field "actions" contains a list of action
docs which have some metadata (what action was performed, by who,
timestamp, etc), and they're indexed as nested docs.

Recently we had an automated process run amok, and it's applied huge
numbers of actions to some of the docs. I'd like to find which docs those
are, to clean up the mess. Ideally I'd like to rank (primary) docs by
number-of-nested-actions, but a simple filter ("all docs with more than
1000 actions") would be good enough too. But I don't see an obvious way to
do either. Any tips?

Cheers,
Tikitu

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/acafe2f5-57de-4397-a2aa-4ebd637e62fb%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/acafe2f5-57de-4397-a2aa-4ebd637e62fb%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Anne Veling
BeyondTrees.com
+31 6 50 969 170
@anneveling

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAAwWRi%2BD_bdNiwBsSa%2BxOJdLQXA9aR4u_YFAVaABQpift31T8Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for the suggestion Anne.

In fact I found a slightly tricky workaround: with score_mode: "sum" on a
match_all nested query the score on the main document effectively counts
the nested docs! (I'm not sure this is literally true, but certainly the
score grows monotonically with the number of nested docs, which is all I
need.) Putting a min_score on the top-level query gives me a
rough-and-ready selection of the worst offenders:

{
"min_score": 3000,
"query": {
"nested": {
"path": "actions",
"score_mode": "sum",
"query": {
"match_all": {}
}
}
}
}

This doesn't perform very well (a several-second wait on our 700Mdoc index)
but it's acceptable for this quick-and-dirty investigation.

Cheers,
Tikitu

On Thursday, 25 September 2014 15:45:45 UTC+2, Anne Veling wrote:

You could use a script on the nested array field and use the "...
actions.size()" length of the array itself, and use that either in sorting,
or in a script facet. Both will be slow and cpu-intensive, but at least you
could get the set out? Is that what you need?

Anne

On Thu, Sep 25, 2014 at 11:19 AM, Tikitu de Jager <tik...@buzzcapture.com
<javascript:>> wrote:

Hi folks,

We're using nested docs to maintain a record of actions that were
performed on our primary docs: a field "actions" contains a list of action
docs which have some metadata (what action was performed, by who,
timestamp, etc), and they're indexed as nested docs.

Recently we had an automated process run amok, and it's applied huge
numbers of actions to some of the docs. I'd like to find which docs those
are, to clean up the mess. Ideally I'd like to rank (primary) docs by
number-of-nested-actions, but a simple filter ("all docs with more than
1000 actions") would be good enough too. But I don't see an obvious way to
do either. Any tips?

Cheers,
Tikitu

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d6f71ae5-8a14-4f82-8e08-d6aae382858e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.