ES2.0 - Pipeline Aggregation for logging user?


(Horst Birne) #1

Hey guys,

i want to do some feedback about the pipeline aggregations and the examples that have been brought till now.

When i first read about pipeline aggregation is was like: "Oh very cool - now i don´t need shell scripts for more complex searches".
In my mind pipeline aggregation was like: "take the results of the first aggregation and put it in an parameter of the next aggregation", something like

agg1 | agg2 | agg3

but than i read that your concept is that the pipeline aggregation, don´t perform any more searches on shards but rather parse the results of the initial aggregation and do some stuff with it (at least in the examples it mostly involved some fields with numeric values)

Let me give an example of what i thought you could do with pipeline aggregation:

A Sysadmin wants to know which IP-address have accessed port x, port y and port z.

If you want to achieve this now, you could do an aggregation on field IP with port:x as querystring,
take the resulting ip-addresses of this aggregation and put it into an filter of the next aggregation, where port:y is set as querystring, do the same thing again with the results of the second aggregation and put it in the last aggregation with port:z as querystring.

This methods works, but requires scripting on the client-side.

I know the ES was made for a various types of use cases and i do understand that the current concept of the pipeline aggregation is very suitable for many use-cases.

I only can speak from the perspective of an "ES-logging"-user, so whats your opinion about this?


(Colin Goodheart-Smithe) #2

Doing this in single request would not scale well with large number of IP adresses since we would have to keep a lot of interim state (as you presumably doing in you client-side scripts). Also if you are asking for all IP-addresses each time (size: 0 on the terms agg) then you will be putting a lot of memory pressure on Elasticsearch as the number of IP-addresses gets large. To make this more scalable you may want to look at this video on entity-centric indexing by @Mark_Harwood. By also having an index which stores one document per IP-address with fields for attributes such as ports_accessed you can get your list of IP's which have accessed ports x, y, and z by doing a simple boolean filter.

HTH


(Horst Birne) #3

thanks for your reply, im gonna have a look at it.

That said, the example with the accessed ports, was just one of many use cases where pipeline aggregation, that allow to pipeline terms-aggregation would be quite useful (i think of the possibility for Kibana4 to have powerful query syntax, again just like query1 | query2 )

As i understand this sort of query would be very expensive to do and maybe run for a few minutes - depending on the amount of data, but just to have such an way would, in my opinion, increase the usability of Kibana.

As said earlier, this is of our perspective with limited view of all the other use cases of ES/Kibana


(system) #4