Low-level Lucene queries in Scripts Plugins

Hi,

During the last Paris Elastic On , I've met Adrien & Jim to talk about our usecase.
In our platform we use ES to store hierarchical data and nor ES nor Lucene are very good with this kind of documents.
So I wanted their advices about

  • creating a "reverse-children" aggregation, that should help us
  • creating an expression script langage that can take care of the document "children", with good performances (rely on docValues instead of sources)

Since ES'On, I had some success with theses tasks, but I had to dig deep into Lucene level to have good perfs.
Especially I had to rely on globalordinal maps &/or BitSet, that are supposed to be cached

I didn't find any clean solution to use the ES infrastructure that is in charge of theses globalOrdinal & BitSetFilter caches.

So I've had to patch Elasticsearch (in v6.2.3 actually) to benefit from theses caches:
Actually, my patch consists of

A modification of the FilterScript & SearchScript Factories to access QueryShardContext:

  /** A factory to construct stateful {@link SearchScript} factories for a specific index. */
    public interface Factory {
        LeafFactory newFactory(Map<String, Object> params, SearchLookup lookup, QueryShardContext queryShardContext);
    }

This one make the queryshardcontext (and so the globalordinal/bitsetfilter caches) accessible from scripts. There is ~15 impacts in server & modules code.

Some map, and its accessor, on QueryShardContext

 public class QueryShardContext extends QueryRewriteContext {
      private final Map<String, Object> freeMap;

This one is useful for me to use it as a cache for scripts that are used in different contexts in the same query (ex : sorting, filtering & aggregation)

This ES server patching will be a pain for us (in term of maintenance, upgrade etc..)

Do you ES-folks think that theses modifications should be integrated in ES ?
Or may be it's a strongly desired design that scripts can't access QueryShardContext (may be for security reasons) ?
Or may be there is a better way to use GlobalOrdinal caches in scripts ?

Thanks
Franck

Pinging @jimczi and @jpountz as mentioned here.

Or may be it's a strongly desired design that scripts can't access QueryShardContext (may be for security reasons) ?
Or may be there is a better way to use GlobalOrdinal caches in scripts ?

I don't see how you would use the global ordinals, can you explain the use case and what your script does ?
Also even if you have access to global ordinals you cannot access the children in a script so what is the purpose here ? Regarding the map to share states between scripts executed in different contexts (sorting, aggs, ...), I think it's dangerous and again I'd like to understand the real use case before looking at the solution. Sorry if you already explained your use case during Elasticon Paris but it was some time ago ;).
I think we also discussed the possibility to denormalize your data, this would duplicate the informations but also simplifies greatly the handling of relationships within a document. Is it something you tried ?

Hi Jim,
Thanks for your answer.

I will try to explain my usecase and to be concise (not so easy).
Our main dataset are low level events (logs, or DB records...), basically the kind of stuff that are usually injected in ES when used as logs centralizer

The purpose of our platform is to do process mining on theses logs (if you google on "process mining", you will find it's a way to infer the business processes from low level events)

So these events, that we name "Activity" are ingested in our pipeline and come back correlated and enriched, especially :

  • each activity is now linked to a "parent" activity, up to a "top" activity
  • each activity also references directly its "top-Activity"

This top-Activity and all its descendants is called a Case, the top activities are often events related to high-level application (sale frontend, public webapp etc...)

Activities are stored in ES, the Activity -> Case relation is modeled as child-parent ES join

So we can do realtime analytics on theses Case
With standard ES queries/aggregations we can for example answer the following questions

  • for a given process (a population of Cases that have the same kind of topActivity) what platforms (middleware, backends etc) are used and how much time are spent in each of them
  • find all cases for a given customer
  • find all cases that have a given activity but don't have another activity X hours after the beginning of the case
  • compute the e2e time distribution (histogram) for a given process
  • find e2e time anomalies in some process (we plan to enhance this feature using XPack ML)

What I try to do now is to query/aggregate metrics that involve several Activities of a same Case
For example I want to answer these kind of question :

  • for a given process, what is the average time between the first occurence of activity A from platform 1 and the last occurence of any activity coming from platform 2
  • same query, get the histogram of the delay
  • considering a population of Activity B, what is the distribution of its parents in a given platform C.

So we have implemented a custom script engine (something that looks like your ES expression lang) that implements this kind of grammar:

"<some query on children>.<sort criteria>.<child property> <operator> <some query on children>.<sort criteria>.<child property>"
example :
"activities(role:some_backend).firstBy(end).end - activities(role:some_middleware).lastBy(start).start"

We can use this grammar
- in filter context ("activities(role:some_backend).first(end).end - activities(role:some_middleware).last(start).start > 1h") :
- in script_field / sort context

I had some difficulties to implement this grammar efficiently.
Ultimately, during the Script constructions, and for each member ("activities(role:some_backend).first(end).end" == one member) I construct & run a Lucene query to build a Map<CaseGlobalOrdinal -> targetPropertyGlobalOrdinal>
(in the example the target properties are "start" & "end")
In the final FilterScript.execute() that matches the Case docs, I use these maps to evaluate the expression.
if the same expression is used in several context (example script_field & sort) the maps are built only one time because they are stored on and reused from the QueryShardContext

I think we also discussed the possibility to denormalize your data

Yes we have discussed about it, the idea was to duplicate all the Activity docs as nested docs of the Case doc, to lead to better perfs
But :

  • we need to query the Activity docs, scroll & scan etc, so we need to keep the Activity docs as separate entities too. So it's a pure denormalization and has the consequence to double the volume of our ES clusters.
  • In the previous (very simplified) example, We query Case docs using Activity level criterias. But we also have the need to query Activity docs using Case level criterias, and it's totally
    unthinkable to denormalize the whole Case on each Activity (would lead to exponential volume increase)

Franck

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.