Dec 16th, 2023: [EN] Symphony of Efficiency: AIOps Orchestrating Operational Excellence

This post is also available in portuguese.

Before we explore AIOps, let’s clarify some key concepts related to some [not all :sweat_smile:] of the different Ops:

  • DevOps: DEV + OPS

You've probably already heard of DevOps. It is a methodology that integrates the work of the software development team (Dev) and the Operations team (Ops) by facilitating delivery through collaboration and automation.

  • DataOps: DATA + OPS

Recognizing the continuous growth of data and the associated challenges, you can understand DataOps as the application of DevOps principles and practices to data. It is a process to enable the right data to get to the right place by managing the entire data lifecycle.

  • MLOps: ML + DEV + OPS

Similarly, MLOps (Machine Learning Operations) can be viewed as the application of DevOps principles to machine learning pipelines, where cross-functional collaboration operationalizes machine learning, ensuring the reliability and performance of ML models.

  • AIOps: AI + DEVOPS

AIOps (Artificial Intelligence Operations) joins this mix of terms and is also related to AI/ML. However, while MLOps focuses on the development and deployment of ML models, AIOps concentrates on the management of IT operations with the application of artificial intelligence (AI) capabilities to optimize business outcomes.

AIOps is not a replacement for DevOps! It is an evolution within the same cycle. AIOps, uses Artificial Intelligence to automate, simplify, accelerate and optimize IT operations processes.

So, What can I do with AIOps?

It all starts with data…

This can include:

  • Logs, metrics and traces
  • Performance and event data
  • Infrastructure and network data
  • Application data
  • Incident-related data
  • Historical data

To get started, on (Kibana -> Analytics -> Machine Learning) you will see different AI capabilities:

Anomaly detection constructs a probability model based on data patterns, and you can run the job continuously to identify abnormal events over time. With it you can identify anomalies and generate alerts to resolve and avoid issues before they happen.

Examples of what you can do include, but aren't limited to:

  • Identify anomalies in categorized log entries based on pattern values
  • Estimate the probability of a time series value occurring at a future date
  • Identify fields that influence or contribute to anomalies

Data Frame analytics is a multivariate analysis and enables you to analyze your data using classification, outlier detection, and regression algorithms. With it you can predict different classes or categories based on your fields, detect data points that are significantly different from other values, and also estimate the relationships among different fields in your data.

Examples of what you can do include, but aren't limited to:

  • Predict error categories based on historical logs
  • Estimate the relationships among different metrics to understand how changes in one metric may influence others

Natural Language Processing allows you to interpret and manipulate human language text. You can import and deploy trained models into Elasticsearch, the models will be available on ‘Trained Models’ and you can utilize them to enrich your data.

Examples of what you can do include, but aren't limited to:

  • Enrich incident-related data with sentiment analysis values
  • Identify the language and translate support tickets
  • Analyze and correlate unstructured text data, identifying keywords, and considering ambiguities and context

AIOps Labs provides statical methods to help you interpret your data and its behavior. With the log rate analysis you can identify reasons for increases or decreases in log rates, with log pattern analysis you can find patterns in log messages and with change point detection you can detect change points in a metric of your time series data.

The AI Assistant, powered by a connector for OpenAI can also contribute with your AIOps strategy. It provides the ability of utilizing OpenAI gpt-4+ to explain error messages and suggest remediation and request, analyze, and visualize your data.

Examples of what you can do include, but aren't limited to:

  • Get contextual information. Upon identifying statistically significant contributors to log spikes through log rate analysis, the AI Assistant explains potential causes and suggests effective remediations.

  • Have conversations with the AI Assistant. Add external information to the knowledge base of the assistant and get real time additional information and insights.

It executes the summarize function, which is a function designed to summarize content from the conversation, and the result will be stored.

{
  "name": "summarize",
  "args": {
    "id": "log_rate_spike_test",
    "text": "The log rate spike in the PostgreSQL database running in a Kubernetes environment was a test and has been resolved. It should not be considered as an issue in the future.",
    "is_correction": false,
    "confidence": "high",
    "public": true
  }
}

Now, it executes the 'recall' function, a function designed to retrieve previous learnings. The AI Assistant uses ELSER, Elastic’s semantic search engine, to recall data from its internal knowledge base index to create retrieval augmented generation (RAG) responses.

Note that the text is now different and includes the updated information.

You can also ingest external data (GitHub issues, Markdown files, Jira tickets, text files, etc.) into Elasticsearch and reindex your data into the AI Assistant’s knowledge base.

POST _reindex
{
    "source": {
        "index": "<InternalDocsIndex>", //name of the index where your internal documents are stored
        "_source": [
            "<text_field>", //name of the field containing your internal documents' text.
            "<timestamp>", //name of the timestamp field in your internal documents.
            "namespace",
            "is_correction",
            "public", //true or false. If true, the document is available to users in the space defined in the following space field or in all spaces if no space is defined. If false, the document is restricted to the user indicated in the following user.name field.
            "confidence"
        ]
    },
    "dest": {
        "index": ".kibana-observability-ai-assistant-kb-000001",
        "pipeline": ".kibana-observability-ai-assistant-kb-ingest-pipeline" //this pipeline contains the Elastic Learned Sparse EncodeR model.
    },
    "script": {
        "inline": "ctx._source.text = ctx._source.remove(\"<text_field>\");ctx._source.namespace=\"<space>\";ctx._source.is_correction=false;ctx._source.public=<public>;ctx._source.confidence=\"high\";ctx._source['@timestamp'] = ctx._source.remove(\"<timestamp>\");ctx._source['user.name'] = \"<user.name>\""
    }
}

As you can see, besides DevOps, you can include AIOps in your observability strategy with Elasticsearch and implement strategies like these to enhance efficiency, address issues proactively, and keep improving your system's performance and reliability.

In the Symphony of Efficiency, AIOps orchestrates operational excellence so you can enjoy a harmonious holiday season.

Happy Holidays!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.