Dec 21st, 2022: [EN] How to import an ML model with Eland, if you're not a Python developer

Why do I need an ML model?

An ML model in Elasticsearch could be used for enriching your data during indexing. Some examples of what an ML model can do:

  • extract entities from text (NER)
  • predicting classes (classification)
  • identify language (classification)
  • generate text embeddings (for subsequent vector search).

For more information, see Machine learning in the Elastic stack.

How do I get an ML model?

To manage trained models in Elasticsearch, you need at least a Platinum license, but if you just want to experiment with the feature, you can start a trial.

After starting a trial, you should be able to see "Analytics" -> "Machine learning" in Kibana menu, and after clicking that, you would see "Trained models" tab under "Model management" section.

Hugging Face website is a great resource for trained models.

What is Eland?

Eland is an open-source tool for data analysis, specifically for data stored in Elasticsearch. Currently, it's also the recommended tool for importing an ML model in Elasticsearch. This is the use case we're talking about here.

How do I run it?

You'll need to either install Eland, or run it in Docker.

With Docker

First, you have to install Docker.

Then, you need to clone the Eland repository. Assuming you have git:

git clone git@github.com:elastic/eland.git

Then, you have to build the elastic/eland image with Docker:

cd /path-to-repository/eland
docker build -t elastic/eland .

After that, you would run a command in your terminal / console application similar to this:

docker run -it --rm --network host \
    elastic/eland \
    eland_import_hub_model \
      --url http://host.docker.internal:9200/ \
      --hub-model-id philschmid/distilroberta-base-ner-conll2003 \
      --task-type ner \
      --es-username elastic \
      --es-password changeme \
      --start

The command above:

  • runs a Docker container with eland installed in it
  • in the container, calls eland_import_hub_model to import a model from Hugging Face website called philschmid/distilroberta-base-ner-conll2003
  • assumes that Elasticsearch is running and accessible on localhost, port 9200
  • provides username and password of elastic user.

host.docker.internal is a special DNS name that resolves to the internal IP address used by the host. Using this name, services inside the container can access services running on the host machine.

Without Docker

Eland is a Python module, which means it has to be installed using one of the Python package installers - either pip or conda. In most modern operating systems, Python is pre-installed, and if you have Python, you have pip.

Prerequisites

There are a few prerequisites (OS packages) that should be installed before you can install and run Eland. The following command is for Debian-based OS:

sudo apt-get install -y \
    build-essential pkg-config cmake \
    python3-dev libzip-dev libjpeg-dev

Other Linux distributions such as CentOS, RedHat, Arch, etc. may require using a different package manager and specifying different package names.

If you're using Windows, it's likely that you don't need to install any of those.

If you're using macOS, you might need to install Xcode command-line tools:

xcode-select --install

Compatibility

Eland has a few compatibility requirements:

  • Python 3.7+ and Pandas 1.3.
  • Elasticsearch 7.11+, recommended 8.3 or later.
  • PyTorch 1.11.0 or earlier.

First of all, run your preferred terminal application, and check your python version:

python --version

On macOS, Python 3 can be installed with homebrew:

brew install python

As of November 2022, you can download and install Python from 3.7.x to 3.11.x. Usually, you want to install the latest version, unless there's a specific reason to use older Python, and we have a specific reason. We need to install PyTorch 1.11.x, which requires a Python version less than 3.10.

Once Python is installed, check its --version as specified above.

It's usually a good idea to upgrade Python packages related to the package installer. Even if you just installed Python, they might not be at their latest version:

python -m pip install --upgrade pip setuptools wheel

Installing Eland and PyTorch

After that, you're good to go and install the eland package:

python -m pip install eland

Eland lists pandas as one of its default requirements, so at this point, it would also be installed, with no extra actions needed. By default, Eland doesn't install PyTorch, so we need to install it explicitly:

python -m pip install 'eland[pytorch]'

This might take a while. Eventually, if the command completes successfully, you should be able to run the Eland command to import a model, called eland_import_hub_model.

If you're getting an error during PyTorch installation that says "No matching distribution found", this means that PyTorch doesn't provide binaries (wheels) for your combination of Python version, OS and architecture yet. In this case, your best bet is to open an issue in PyTorch. See a similar issue here.

Importing the model

The command eland_import_hub_model has the following options:

> eland_import_hub_model --help
usage: eland_import_hub_model [-h] (--url URL | --cloud-id CLOUD_ID) --hub-model-id HUB_MODEL_ID [--es-model-id ES_MODEL_ID] [-u ES_USERNAME] [-p ES_PASSWORD] [--es-api-key ES_API_KEY]
                              [--task-type {fill_mask,question_answering,zero_shot_classification,text_embedding,text_classification,ner}] [--quantize] [--start] [--clear-previous] [--insecure]
                              [--ca-certs CA_CERTS]

optional arguments:
  -h, --help            show this help message and exit
  --url URL             An Elasticsearch connection URL, e.g. http://localhost:9200
  --cloud-id CLOUD_ID   Cloud ID as found in the 'Manage Deployment' page of an Elastic Cloud deployment
  --hub-model-id HUB_MODEL_ID
                        The model ID in the Hugging Face model hub, e.g. dbmdz/bert-large-cased-finetuned-conll03-english
  --es-model-id ES_MODEL_ID
                        The model ID to use in Elasticsearch, e.g. bert-large-cased-finetuned-conll03-english.When left unspecified, this will be auto-created from the `hub-id`
  -u ES_USERNAME, --es-username ES_USERNAME
                        Username for Elasticsearch
  -p ES_PASSWORD, --es-password ES_PASSWORD
                        Password for the Elasticsearch user specified with -u/--username
  --es-api-key ES_API_KEY
                        API key for Elasticsearch
  --task-type {fill_mask,question_answering,zero_shot_classification,text_embedding,text_classification,ner}
                        The task type for the model usage. Will attempt to auto-detect task type for the model if not provided. Default: auto
  --quantize            Quantize the model before uploading. Default: False
  --start               Start the model deployment after uploading. Default: False
  --clear-previous      Should the model previously stored with `es-model-id` be deleted
  --insecure            Do not verify SSL certificates
  --ca-certs CA_CERTS   Path to CA bundle

Example command:

eland_import_hub_model \
  --url http://localhost:9200/ \
  --hub-model-id philschmid/distilroberta-base-ner-conll2003 \
  --task-type ner \
  --es-username elastic \
  --es-password changeme \
  --start

Notice that if you're importing the model into an Elastic Cloud instance, you have an option to provide a --cloud-id, vs a URL of the cluster.

The model is imported and started. Now what?

When you log in to Kibana and proceed to "Machine learning" -> "Trained models", you'll see a message informing you that you have new objects to synchronize. After clicking "Synchronize", the model will be accessible, and you can test it by clicking at the "..." button in "Actions" column and choosing "Test model".

actions-test-model

The next step would be configuring an ingest pipeline with an inference processor, and using the pipeline when indexing your data.

Troubleshooting

If something didn't work with Eland, I would suggest the following sequence of actions:

  1. Google for the error.
  2. Search the existing issues in Eland repository for the error.
  3. Open a new issue in Eland repo.

The first two steps would often be enough to resolve the problem, but if not, don't hesitate to open an issue and ask for help. You'd be helping others who may be struggling with the same problem.

3 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.