Multipipelines or one pipeline with lot of filters - Performance


#1

I have one logstash server where I listening for some tweets.
Let's say that I have a "global" twitter index.

Twitter input > filter -> output: elasticsearch (twitter-%{YYYY.MM.dd})

I want know to add new input twitter with other keywords and bounding box, etc... I want to save it on another index pattern to use the name as pre-query for performance.

Twitter 2 > filter -> output: elasticsearch (twitter-specific-%{YYYY.MM.dd})

According to logstash performance (JVM, etc.), resources, etc. is it better to have a multi-pipelines in one logstash instance (pipeline by specific index pattern output) or one big pipeline with a lot of "if" filters in filter and output parts of the logstash config ?


(Guy Boertje) #2

In my mind there are some considerations at play here.

A. The bulk index performance of the elasticsearch cluster.
B. The effective batch size reduction in conditional branches.
C. Whether the "global" index be a superset of the other specific indexes.

I'll explain batch fragmentation first using a convenient batch size of 100

  1. A single LS pipeline with 2 conditional blocks
    1. Assume 6 workers
    2. Two inputs fetch and creates events distinguishable by type '-X' | '-Y'
    3. A batch of 100 of mixed events is pulled from the queue by a worker
    4. The batch arrives at the type == '-X' conditional, 50 matching events are selected and sent through the filters in this branch
    5. The batch arrives at the type == '-Y' conditional, 50 matching events are selected and sent through the filters in this branch
    6. The batch arrives at the general filters, 100 events are handled here
    7. One elasticsearch output is needed, the index is "twitter%{[type]}-%{YYYY.MM.dd}"
    8. From the the ES side 100 events are bulk imported per worker, 6 X 100 = 600 events - in 6 bulk requests
  2. Two separate pipelines
    1. Assume 3 workers per pipeline
    2. One pipeline for "X" and one for "Y"
    3. Each pipeline has its own input
    4. A batch of 100 of homogenous events is pulled from the queue by a worker
    5. The batch arrives at the filters, 100 events are handled here
    6. Each pipeline has its own elasticsearch output, the index is "twitter-X-%{YYYY.MM.dd}" and "twitter-Y-%{YYYY.MM.dd}"
    7. In ES 100 events are bulk imported per worker, 2 X 3 X 100 = 600 events - in 6 bulk requests
  3. Three pipelines (using a feature coming in LS 6.3.0)
    1. Add a common 'collector' pipeline having a single ES output and a pipeline input
    2. The common 'collector' pipeline can have a batch size of 600, 1 worker and its own queue
    3. One pipeline for "X" and one for "Y" having a pipeline output pointing to the common pipeline's pipeline input
    4. All of 2.3 to 2.5 above occurs here as well
    5. Each pipeline sends 100 events to the common pipeline but each has prepared a "[@metadata][es-index]" field with "twitter-X-%{YYYY.MM.dd}" or twitter-Y-%{YYYY.MM.dd}
    6. The common 'collector' pipeline takes 600 events from its queue and using an index of "[@metadata][es-index]" sends a single bulk request of 600.

Consideration A: ES Cluster bulk request performance.
In any of 1, 2, 3 above, any drop in ES bulk request performance will affect LS performance far more than small differences in performance by selecting solution 1, 2 or 3.

Consideration B: The effective batch size reduction in conditional branches.
There is a minor overhead in preparing a subset batch that matches a conditional expression. More complex conditionals have a slightly higher hit on performance.
The two pipeline (2) solution is easier to develop and test as each pipeline can be built and tested separately before bringing them together in one LS instance. Any change to the event structure in one will not need to be considered in the other pipeline. On the downside, any common ongoing changes to the elasticsearch output (certs, passwords etc.) will need to changed in two configs.

Consideration C: Whether the "global" index be a superset of the other specific indexes
Here I am wondering if the specific tweets are always seen in the global set - do they need to be fetched twice? If not and it is possible to detect that a global tweet needs specific handling then you can choose solution 1 with one input, you can clone that event and use conditionals to process the clone.
You get two increases in efficiency here 1) less calls to twitter, 2) bulk index size increases by the number of events cloned, i.e you start with 100 global events then clone, say, 50 specific events, the batch size is now 150 events which is what the elasticsearch output sends to the cluster.

I built a performance measuring output for my own use when I made the dissect filter. You are welcome to use it but you need to clone the repo locally and build and install the gem/plugin.


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.