Access nested/children doc values with Lucene from FilterScript plugin

Hi,
I'm actually trying to prototype some filter plugin. The objectives are :

  • Optimize access to the nested documents. Actually (in a painless script for example) the only way to process nested entities is to process the _source of the parent. I would like to measure if accessing doc values of the nested entities leads toward better perfs.

  • Make possible the access to children documents source and doc-values (which seems actually impossible in any way from eg a painless script)

In a filter plugin, we don't have access to the SearchContext, so I use low-level Lucene queries directly over the leafReader.

I'm able to query children documents. Due to the parent/child data structure, I have to loop over all the segments of the shard. Something like

SortedSetDocValues sortedSetDocValues = context.reader().getSortedSetDocValues("id");
sortedSetDocValues.advanceExact(doc);
BytesRef bytesRef = sortedSetDocValues.lookupOrd(sortedSetDocValues.nextOrd());
String id = bytesRef.utf8ToString();
BooleanQuery booleanQuery = new BooleanQuery.Builder()
                                        .add(new TermQuery(new Term("esType", "activity")), org.apache.lucene.search.BooleanClause.Occur.FILTER)
                                        .add(new TermQuery(new Term("esJoin#case", id)), org.apache.lucene.search.BooleanClause.Occur.FILTER)
                                        .build();
TopDocs theChildren = searcher.search(booleanQuery, 10);

Q : are there ways to optimize this query, eg can I find a way to exploit the global-ordinal cache at this level ?

For the nested documents, I've understood that what defines a nested document is its position in the segment regarding its parent. Nested documents are all the documents with a certain _type that appears before it's parent, and after the previous parent entity.

ElasticSearch optimizes nested queries with a Lucene ToParentBlockJoinQuery query that uses a bitset referencing all parents entities in the segment.

Q : in a filter plugin, I'have only the id of the parent document that I'm actually scoring, and the leafReaderContext
Is there a good way to find the nested documents. Actually I've the feeling that the only way would be to reference the parent on the nested entity and to do a query on this reference.

(message to Adrien & Jim, if you come here : at es-on Paris, you told me that it wasn't a good idea to try to process children in a filter, but I'm stubborn and I'm not sure it wont be usefull for our usecase)

1 Like

Based on what your objectives are, I don't think you should pursue developing a custom plugin, It is way to complicated and I think you can achieve your objectives without writing a custom plugin.

Optimize access to the nested documents. Actually (in a painless script for example) the only way to process nested entities is to process the _source of the parent. I would like to measure if accessing doc values of the nested entities leads toward better perfs.

So this is possible via the search api, you can access doc values fields of nested documents, provided that you're in the right context. For example you can access doc values fields of nested documents via inner hits with docvalues fields: https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-request-inner-hits.html#nested-inner-hits

Also from lets say a script query or function score query you can access nested documents' doc value fields provided that you these query are wrapped in a nested query.

Make possible the access to children documents source and doc-values (which seems actually impossible in any way from eg a painless script)

Also with inner hits it is possible to access the nested document's part of the _source. However ES needs to perform a trick to include only the nested relevant bit of the source.

If you just plan to access doc values fields I recommend you take a look at this:
https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-request-inner-hits.html#nested-inner-hits-source

Hi Martjin, nice to read you.

Our usecase implies "activity" entities. Bunch of them (100's or 1000's) can be related to the same "case" entity. In the index this is materialized by a child(activity)->parent(case) relation.
"activity" entities also carries a nested collection of "metadata" (name -> value)

Our analysis consists in selecting cases or activities with criterias implying other activities of the case.
For example, we want to find all the cases (scroll on case entities) for which an activity of type A, (defined with some metadatas pattern) is compared (eg a delay between 2 dates) with another activity of type B related to the same case and selected with another pattern.
Other example, we need to find all the activities (with a scroll of activity entities) that are related to a case which contains others activities selected with different patterns.

The only way we found for this kind of analysis, with the current ES toolset implies a mix of composite_aggregation, children, bucket_script and bucket_selector, but it's not satisfactory.

I don't think that what you suggest meet our needs, In a nested filtered context, we can actually access to the doc values of the nested entities, but what we need is comparing different nested (or children) entities selected with distinct filters.

Can you provide a search request that you tried to build, but didn't get the results that you needed? And also specify the results you expect to be returned. I think this way it easier for me to see what you're trying to achieve and whether it is possible with features that come out of the box with ES.

Our mappings look like theses pseudo-mappings (a little part of the real structure):
(in ESV6, it's almost the same, with a join instead of a parent)

index cases
     case
          action:keyword
          action.raw:text
          start:date
          end:date
    activity
         _parent:type:case
         action:keyword
         action.raw:text
         start:date
         end:date
         metas:nested_array
               metas.name:keyword
               metas.name.raw:text
               metas.value:keyword
               metas.value.raw:text

now, I want for example to select all the case that :
- has at least one child activity with action.raw:action_a
- has at least one child activity with action.raw:action_b
- meet this criteria : activities(action.raw:action_a).min(start) - activities(action.raw:action_b).min(start) >= 1d

(This example is very simplistic . in reality, activities have often to be selected with criterias about metas nested array, and cross-children criteria can also be very complex)

If case.activities was a nested array we should painless script on case.source_ field to implement activities(action.raw:action_a).min(start) - activities(action.raw:action_b).mins(start) >= 1d. But activities have to be children, because we also have to find activities (/cases/activity/_search) with the same kind of complex filters.

The only way we found to implement this kind of cross-children filter is with a term aggegation whose terms are the case ids
pseudo-query :

POST /cases/case/_search {
    query
         bool.filter
             has_child(activity, action.raw:action_a)
             has_child(activity, action.raw:action_b)
    aggs
         term(id, size:1000)
              children(activity)
                    filter(action.raw:action_a)
                         min(start)
               children(activity)
                    filter(action.raw:action_b)
                         min(start)
               bucket_selector(action_a.min_start - action_b.min_start > 1d)

But this approach have flaws :
- the term() approach is not scalable with a big case population(eventually we had some success with the composite agg at this level)
- selected cases can't be aggregated in cascade (excepted with pipeline aggs). For example, for the so selected cases, we should need a term agg on case.action

I 've the feeling that core ES dev team members are not big fans of the child/parent/joins, because they are bottleneck (comparing to other features) in term of performances. But please don't forsake it, it's a damn good tool for some usecases :wink:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.