I want know to add new input twitter with other keywords and bounding box, etc... I want to save it on another index pattern to use the name as pre-query for performance.
According to logstash performance (JVM, etc.), resources, etc. is it better to have a multi-pipelines in one logstash instance (pipeline by specific index pattern output) or one big pipeline with a lot of "if" filters in filter and output parts of the logstash config ?
In my mind there are some considerations at play here.
A. The bulk index performance of the elasticsearch cluster.
B. The effective batch size reduction in conditional branches.
C. Whether the "global" index be a superset of the other specific indexes.
I'll explain batch fragmentation first using a convenient batch size of 100
A single LS pipeline with 2 conditional blocks
Assume 6 workers
Two inputs fetch and creates events distinguishable by type '-X' | '-Y'
A batch of 100 of mixed events is pulled from the queue by a worker
The batch arrives at the type == '-X' conditional, 50 matching events are selected and sent through the filters in this branch
The batch arrives at the type == '-Y' conditional, 50 matching events are selected and sent through the filters in this branch
The batch arrives at the general filters, 100 events are handled here
One elasticsearch output is needed, the index is "twitter%{[type]}-%{YYYY.MM.dd}"
From the the ES side 100 events are bulk imported per worker, 6 X 100 = 600 events - in 6 bulk requests
Two separate pipelines
Assume 3 workers per pipeline
One pipeline for "X" and one for "Y"
Each pipeline has its own input
A batch of 100 of homogenous events is pulled from the queue by a worker
The batch arrives at the filters, 100 events are handled here
Each pipeline has its own elasticsearch output, the index is "twitter-X-%{YYYY.MM.dd}" and "twitter-Y-%{YYYY.MM.dd}"
In ES 100 events are bulk imported per worker, 2 X 3 X 100 = 600 events - in 6 bulk requests
Three pipelines (using a feature coming in LS 6.3.0)
Add a common 'collector' pipeline having a single ES output and a pipeline input
The common 'collector' pipeline can have a batch size of 600, 1 worker and its own queue
One pipeline for "X" and one for "Y" having a pipeline output pointing to the common pipeline's pipeline input
All of 2.3 to 2.5 above occurs here as well
Each pipeline sends 100 events to the common pipeline but each has prepared a "[@metadata][es-index]" field with "twitter-X-%{YYYY.MM.dd}" or twitter-Y-%{YYYY.MM.dd}
The common 'collector' pipeline takes 600 events from its queue and using an index of "[@metadata][es-index]" sends a single bulk request of 600.
Consideration A: ES Cluster bulk request performance.
In any of 1, 2, 3 above, any drop in ES bulk request performance will affect LS performance far more than small differences in performance by selecting solution 1, 2 or 3.
Consideration B: The effective batch size reduction in conditional branches.
There is a minor overhead in preparing a subset batch that matches a conditional expression. More complex conditionals have a slightly higher hit on performance.
The two pipeline (2) solution is easier to develop and test as each pipeline can be built and tested separately before bringing them together in one LS instance. Any change to the event structure in one will not need to be considered in the other pipeline. On the downside, any common ongoing changes to the elasticsearch output (certs, passwords etc.) will need to changed in two configs.
Consideration C: Whether the "global" index be a superset of the other specific indexes
Here I am wondering if the specific tweets are always seen in the global set - do they need to be fetched twice? If not and it is possible to detect that a global tweet needs specific handling then you can choose solution 1 with one input, you can clone that event and use conditionals to process the clone.
You get two increases in efficiency here 1) less calls to twitter, 2) bulk index size increases by the number of events cloned, i.e you start with 100 global events then clone, say, 50 specific events, the batch size is now 150 events which is what the elasticsearch output sends to the cluster.
I built a performance measuring output for my own use when I made the dissect filter. You are welcome to use it but you need to clone the repo locally and build and install the gem/plugin.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.