Scalable Logstash Config Generator


(Justin Bovee) #1

I have been using Logstash for over 2 years and love it.

One frustration I have had is scaling Logstash across multiple (100+) different data types. (These are not like dissimilar syslog types but completely different data type ingestions.

  • Logstash configuration (Even when split across multiple files) becomes unwieldy.
    • My current config is nearing 10,000 lines
  • Data type bleed-over between data types can easily occur due to mislabeling data set code.
  • Utilizing Kafka topics and multiple separate Logstash pipelines doesn't scale well. Need auto-scaling due to current data set.

My current solution is a mix between multiple Kafka Topics and mixing multiple datasets in the same config.

Here is what I want to do (written a little code for this but distracted by other ideas as well) and curious if anyone else in the community has already attempted this where maybe I can tie in to help.

A Logstash / Docker configuration runner that allows auto scaling.

Basically, the following...

  • Utilize a "Logstash Module" for different data sets.
    • A module would contain all configuration, dictionaries, grok expressions for a single data set. A plugin manifest included to show necessary logstash plugins to run the config
  • A REST service that takes a JSON object specifying data input and output methods, modules for config, and environment variables
  • A config generator based on requested Logstash modules
  • A service that tests generated configs from the modules.
  • A docker orchestrator that spins up a single or multiple Logstash containers with necessary plugins and applies the generated config from the generator. This will allow me to autoscale Logstash containers based on queue in Kafka for instance.

Thoughts??
Is there any similar related Logstash Projects?


(Mark Walkom) #2

Sounds like an awesome idea, I've pointed this thread to a few people internally :slight_smile:


(Alvin Chen) #3

Hi Justin,

Thanks for your contributions in the past with the language filter.

You raise a set of important topics, and we do indeed have ideas around them.

  1. First is around the LS config getting difficult to manage as it grows. Using LS with an assortment of different data types can get tough with the amount of conditionals required to facilitate proper data flows. In the future, we plan to introduce a multiple pipelines feature, which would enable you to run multiple isolated pipelines in a single LS instance. This could help in the aspect of being able to break up super large configs into smaller ones, potentially segregated by data type, to be run in an isolated fashion, but still leveraging the same resources underneath.

  2. The concept of "Logstash Modules" is interesting, and we've thought about it. We are actually working on something called Filebeat Modules (see here) which would be a prepacked solution across the stack (Filebeat config, LS/ingest node config, and Kibana dashboard) for specific industry log files.

  3. Config management is still often done today with the aid of an external tool like Puppet or Chef. How are you doing this today? I don't have any specifics around Docker, but we are working on a couple items on the config management front. It's more around managing LS configs in a central location and creating a workflow to facilitate config changes.

Happy to chat further if you had additional thoughts or requests.

Alvin


(Justin Bovee) #4

Hi Alvin,

Always great to have further conversations. I am not really a dev. Just a Systems guy who develops on need so please correct any problems you may see. :wink:

As to your question in 3.
We currently use Puppet to run the environment.
We ingest using Kafka to separate data types by topic.
Currently, config generation is a manual process.

  • DataTypes with heavy load pattern has its own Logstash config and no other data types and is deployed to multiple workers. It has its own Kafka topic.
  • Multiple datatypes each with a light load has a combined datatype config to handle the load and then deployed to multiple workers pulling from Kafka.
  • I test new data sources with a single config build and then based on expected load I add it where expected in the mix.
  • Depending on time-scales, source, or load, I may not include the Kafka pipeline in the data type config

The need to autoscale based on load patterns, the problems with code mistakes creating bleed-over, and large configs which are difficult to review is the reason I started down this path. My original post has a small breakdown on what I have partially built.

Below is what I was thinking the module would look like

  • /module-name
    • /configs
      • All filter sections for processing a specific data type.
    • /dictionaries
      • Any file dictionaries utilized by logstash-filter-translate or other plugins that utilize a local file for data translation. (I have been working on a system to move these to a centralized system and querying them on need)
    • /templates
      • This is kinda up in the air. I have been playing with it as a schema store for the module as it currently include Elastic templates for the data type.
    • /patterns
      • Custom grok patterns specific to the data type
    • plugins.json
      • In this file I list any necessary plugins as well as sources. This way it can pull from external and internal repos.
    • {module_name}.rb
      • This is the module class file. It includes specific module functions for
        • Setting up the config
        • Building the docker container
        • Handling any variables passed to the module
        • Testing the module. :slight_smile:

I played with a module with Ingestion and Output but decided that is better data set by the person setting up the config in the POST payload.

The service would utilize a Dockerfile that would be customized on each build with necessary Logstash Plugins based on the needs of the called on modules. Now whether the returned payload was a Dockerfile with a generated config file or I spun up the container with config on our docker platform using the service was still a debate. Once I had this, I could then setup another service to monitor queueing and spin up more containers per data type based on incoming load or expected load in an automated method.

This should give you a bit more on what I have been playing with.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.