Best way to visualize health and structure of data processing pipeline

Dear Forum,

in one of our projects we want to monitor the components of several data processing pipelines.
Each of these pipelines has the following structure:

  • Connects to at least one data source
  • Connects to at least one data sink
  • Consists of several applications which consume from / publish to one or multiple solace topics / queues

The monitoring shall be based on two data sources:

  • A repository with the configuration data of the pipelines as they should be (ideal world)
  • Application logs from each element with the real configuration (real world)

The goal is to visualize each pipeline as some kind of flow chart and also show the health of each element in the pipeline (if it is up and running or if it is down / broken). The visualization should be in a way that it can be embedded in a dashboard.
Is there a Kibana app or a plugin which is suitable for this use case?

Hello Florian,

Welcome to this forum!

Are you using the opensource version of Elastic or the BASIC (or higher license)? In all non-opensource versions of the stack there is a monitoring component. Basically, it stores data about load and status in an ElasticSearch index which can then be shown in the Kibana menu "Stack Monitoring" or be used as source for a dashboard or a canvas.

Best regards
Wolfram

Hi Wolfram,

thank you for your response. We're owning a license and can use the monitoring component. I went through the documentation and this is what I gained from it:

  • In order to make components available for monitoring we have to provide the data via some beats module.
  • We could use this to monitor the health of every connected application

Please correct me if I'm wrong in some aspects.

I do net yet know how we can use this for what we want to achieve. Maybe to clarify this: When I mentioned data processing pipelines (DPP) above I did not mean Logstash pipelines. These DPPs are already running and we want to have as little impact on them as possible. So the idea was to run a dockerized Logstash close to the DPPs. The Logstash in turn contains some Logstash pipelines which offer endpoints to which the applications of the DPP can send their logs.
We are interested in two different kinds of logs:

  • Conventional application logging (for which our solution works fine so far)
  • Configuration data (component name, host, list of input channels, list of output channels)

From the configuration data we then want to create a graph of the network structure, compare it to the ideal state and see if some components which should show up are missing or vice versa.
I can see that we can get data about the host and maybe the component name using beats modules, but how could we collect and visualize the input / output configuration?

Some additional information:
The DPPs consist of the following components:

  • dockerized applications which send their logs via Filebeats
  • Spark applications
  • Solace topics / queues
    Best,

Florian

Hello Florian,

I am sorry - when I read about pipelines I immediately thought of LogStash pipelines.

If I understand you right you have an application loading data from various sources and transforming it before sending it to multiple destinations. You do not want to monitor the log pipelines but the data pipelines itself, right?

In this case, my proposal does not help you in anyway. Have you already looked at the APM solution? If your application is written in one of the supported languages(here) this could maybe help you. Basically, you add an agent to your application - e.g. in java it is likely that you do not even have to change the application - and it will send information about the application to an APM server which transforms the data and stores it in an ElasticSearch index. This data can be shown in the Kibana APM app or be used to create custom dashboards. Depending on the technology used it can bring out of the box:

  • show all failed transactions
  • create transaction for each incoming document
  • distributed transaction if there multiple services in a chain
  • ....

Best regards
Wolfram

Hi Wolfram,

thanks, I didn't know about APMs but they do indeed look promising. Actually a service-map comes very close to what we want to have as it visualizes the current state and structure of our DPPs. We would then only need to find a way how to compare it to our target / ideal situation.
However there are two potential obstacles:

  • We use Scala and it does not yet seem to be amongst the supported APM agents
  • We run our jobs on Spark and I'm not sure how APM agents will work on distributed systems. Probably each element of the Cluster will be visible as a separate component in the service map?

Do you have experience with any of these?
Thanks again for pointing me into the right direction and have a nice weekend,
Florian

Hello Florian,

Unfortunately, I do not have experiences with Spark nor with with Scala.

Form what I have read is that Scala is compiled to Java Bytecode so the Java agent might work - even if you might have to manually instrument the code if the used libraries are not yet supported(see here).

I do not know Spark but distributed environments are the place where APM shines as it shows the relations in a service map, shows calls to other services in the same Trace and so on. On each application installation you can configure the service name and the environment so you could have X installations of the same installation shown in the same entry in APM for Y different environments like DEV/QA/PROD. I cannot say how this shows in the Service Maps as we have no Platinum license which is required for the service maps.

Best regards
Wolfram

Hi Wolfram,

We will go for the APM & Service Map solution. Thank you again for your help, it was highly valuable.
Best,

Florian