Dec 17th, 2021: [EN] Getting Started with the Elasticsearch Python Client

Getting Started With The Elasticsearch Python Client

An Introduction To Elasticsearch With Python

Elastic provides its own client for most languages, including of course - Python! There are two official Elasticsearch clients for the Python language:

  • elasticsearch-dsl: A very user-friendly Python client. It is written in Python, and is built on top of the Elasticsearch-py client, in order to abstract some of the lower-level functionality.

  • elasticsearch-py: This is a lower-level Elasticsearch-py client, which gives the user more fine-grained control of their API calls and query management.

We’re going to take a deeper look at Elasticsearch-py in this post. Both clients are open source and very extensible, so if you’re a keen Python developer and see a feature you need that’s missing, don’t hesitate to get involved!

Installing The Elasticsearch Python Client

The Elasticsearch client package can be installed with the pip package manager:

$ python -m pip install elasticsearch

The Elasticsearch Python client also supports asynchronous coding patterns, which you can make use of by installing the client with the async extra:

$ python -m pip install elasticsearch[async]

Getting A Connection To Elasticsearch From Python

The Elasticsearch client class takes many configurable parameters, from http_auth arguments, to certificates. There are many and varied ways to authenticate to Elasticsearch, depending on whether you are attempting to connect to an Elastic Cloud instance, or a self-hosted one. By far the simplest way of connecting is the http_auth argument, which simply takes a tuple of (username, password).

For more on authenticating to Elasticsearch using the elastic-search-py client, see the documentation here.

An example of an Elasticsearch connection object in Python:


from elasticsearch import Elasticsearch
 
def setup_es_conn() -> Elasticsearch:
   """Sets up a connection to the Elasticsearch Cluster."""
   return Elasticsearch(
      'https://elasticsearch.url:port',
       http_auth=('elastic','yourpassword'),
       timeout=60,
       max_retries=2,
       retry_on_timeout=True,
   )

Querying Lots Of Data Using helpers.scan

The helpers classes of the Elasticsearch-py client provide some simple and helpful (as the name might suggest) abstraction functionality over the raw Elasticsearch APIs, such as the bulk and scroll APIs. The helpers.scan function is a simple abstraction on top of the scroll API - a simple iterator that yields all hits returned by underlying scroll requests. The helpers.scan function takes the following as required arguments:

  • An Elasticsearch connection object.
  • An index name to query against as a string.
  • A query as a dict or json object.

An example of querying all data in an index using helpers.scan:


def scan_es():

    elasticsearch_conn =  setup_es_conn()

    results = elasticsearch.helpers.scan(
        elasticsearch_conn,
        index="test_index",
        doc_type="my_document",
        preserve_order=True,
        query={"query": {"match_all": {}}},
    )

    for item in results:
        # parse...

For more on the Scan helper, see the documentation here.

Bulk Insertion To Elasticsearch Using helpers.bulk

The bulk API can be used to insert and query data in bulk from Elasticsearch, and so is extremely useful for handling lots of data efficiently. Using the bulk API with the Python client to insert data can be done by using a simple list comprehension or generator to iterate over your data, like so:

from datetime import datetime
from elasticsearch import Elasticsearch, ElasticsearchException, helpers

def es_insert():
    """Runs a query and parses results."""
    elasticsearch_conn = setup_es_conn()

    try:
        documents = [
               {
                   "_index": "example_data",
                   "_op_type": "create",
                   "_source": {
                       "@timestamp": datetime.now().isoformat(),
                       "file": {"path": file_obj["filename"]}
                   },
                }
           for file_obj in [{"filename": "/tmp/example.txt"}]
        ]
        helpers.bulk(es, documents)
    except ElasticsearchException as err:
        ...

One handy tip to note: Elasticsearch Data Streams support only the create action within the _op_type field. Actions are one of index, create, delete, and update, and if using a regular index, the recommended _op_type is the index action, which is the default if no user input is provided for this field.

For more information on the Bulk API, see Elastic’s documentation here.
For more information on Helpers, see the documentation here.

Happy data wrangling, everyone - thanks for reading!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.