Dec 12th, 2022: [EN] Building Machine Learning models using the Elastic Stack

When someone says Elastic, most people think of data storage, analysis, and visualization, but you don't immediately think of it as a one-stop shop for production-level machine learning (ML). While these are important aspects of the ML lifecycle, the Elastic Stack can also help with data transformation, feature engineering, model building, and monitoring, which covers the process end-to-end.

Here are some Stack features that can enable end-to-end Machine Learning at scale:

  • Aggregations: Aggregations are a great way to answer important questions about your data and can help decide what ML models would best fit your data. Elastic has a large variety of aggregations to calculate standard metrics and statistics as well as scripted metric aggregations to allow for more complicated custom aggregations.

  • Dashboards and Visualizations: The best way to analyze your data is to visualize it. Visualizing your data can not only help surface interesting patterns and behaviors in the data but also help communicate findings with your stakeholders in an effective manner.
    Kibana has a wide selection of visualizations for you to choose from- Lens, where you can drag and drop data fields to build visualizations, TSVB to build sophisticated visualizations on time series data, Maps to visualize geographical data, and Vega, a JSON interface to create complex custom visualizations.

  • Transforms: Transforms aggregate data across multiple data sources and write the output to an index. This not only makes the aggregated data searchable but also opens up opportunities to build new models and analytics on top of it.
    Here is an example of a transform on the web log sample dataset available in Kibana. The transform creates an entity-centric index consisting of a summary of network activity such as the sum of bytes and the number of distinct URLs, agents, incoming requests by location, and geographic destinations for each client IP.
    If you now wanted to identify suspicious client IPs, you could use the transformed data as input to an outlier detection model.

  • Ingest Processors and Ingest Pipelines: Ingest pipelines in Elastic are a way to modify and enrich your data during ingest. They are essentially a collection of ingest processors, which sequentially perform various operations on your data.
    The Stack has several processors that allow you to perform common data processing tasks such as dropping values, setting default values, normalizing data using regular expressions etc. In addition to these, Elastic has inference processors that allow you to enrich your data with predictions from pre-trained ML models. For everything else, you can use script processors.

  • Data Frame Analytics: If you have a Platinum license, the data frame analytics feature allows you to train classification, regression, and outlier detection models in the Elastic Stack. You can kick off multiple model training runs at once, and view important metrics such as accuracy, confusion matrices, and feature importances for the trained models. Once you have a model that you are happy with, you can further use it in an inference processor to make predictions on new data.

  • Anomaly Detection: If you have a Platinum license, the anomaly detection feature allows you to model the normal behavior of your time series data, learn periodicity and trends in it, and identify anomalies. The feature has several analysis functions that allow you to define what should be flagged as an anomaly in your dataset, for example, spikes or dips in values, rare values etc. You can run multiple anomaly detection "jobs" to identify anomalies across multiple data sources, which when viewed together can help surface larger observability issues, security threats etc. in your environment.

  • Eland: Eland is a Python client for Elasticsearch that lets you import external ML models into Elastic for inference. It currently supports models trained using scikit-learn, XGBoost, and LightGBM libraries, as well as BERT models trained in PyTorch.

  • Model Monitoring and Alerting: In a production setup, it is important to monitor ML models for drift, performance changes, and failures. In the Elastic Stack, you can create dashboards in Kibana to visualize model performance over time, as well as leverage the alerting features, namely Kibana Alerting and Watcher to create actions such as sending emails or Slack alerts to stakeholders when certain conditions are met.

If you would like to try the above features, get a free 14-day trial of Elastic Cloud and some sample data to get started.

3 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.