Dec 8th, 2018: [EN][Python/Elasticsearch] Getting started with Elasticsearch DSL and Python


(Honza Král) #1

Getting started with Python DSL

There are two official Elasticsearch clients for Python - elasticsearch-py and elasticsearch-dsl. The low-level elasticsearch-py is a no-opinions client that provides a convenient way to talk to Elasticsearch - handling all the complexities of talking to a distributed system while preserving the simplicity of the REST APIs. Because of this approach it often does not feel too pythonic - it requires you to create complex dicts that represent queries and doesn't offer any help (with the exception of the helpers for bulk and scan). For that there is elasticsearch-dsl.

Elasticsearch DSL

The purpose of elasticsearch-dsl is to provide an easier and more familiar way to work with Elasticsearch. It focuses on just the most common operations like search and generally working with data - we should be able to design a structure of the document, including its mappings and any Python only bit like custom methods, and perform even a complex search without having to write the query as json (or even as a dict) so that we can focus on the query we want to run instead of getting lost in curly braces.

What we still have to know and what the library won't help us with, is the meaning behind the different query types in Elasticsearch - things like what is the difference between match and terms query and what are the options for a multi_match query. These are the things that we still need to decide. On the other hand the library can help us in creating these queries by providing a more convenient syntax:

from elasticsearch_dsl import Search
s = Search(index="i")
# same as {"match": {"title": "python"}}
s = s.query('match', title='python')

Note that we had to assign back to s to keep our changes. That is because the methods on the Search objects always return a cloned copy of the original search with the requested modification. That way you don't have to be afraid of having global search objects or passing them around.

We can also combine multiple queries together using logical operations (| for OR, & for AND, and unary ~ for negation) so we don't have to construct a bool query manually:

# Q is a simple helper function to construct any query
from elasticsearch_dsl import Q

# same as {"bool": {"should": [{"match": {"title": "python}}, {"term": {"category": "python"}}]}}
query = Q('match', title='python') | Q('term', category='python')

# use this as a filter = wrap in {"bool": {"filter": []}}
Search(index="i").filter(query)

You can can call the to_dict() method on any object provided by the library to inspect the underlying json representation, that way you can make sure the library is doing what you expect it to:

print(query.to_dict())
print(Search().filter(query).to_dict())

Many of the objects also have a way to construct them from the original json (in the form of a python dict) which is useful for converting examples and also to migrate from the low-level client to the dsl:

s = Search.from_dict({
  "query": {
    "match": {"title": "python"}
  }
})

# now we can simply add a filter without worrying where in the dict it would go
s = s.filter('term', category='search')

Example application

So let's have a look and create an example application indexing our git history in Elasticsearch and performing some interesting queries.

To get started we have to first configure the library. This is an optional step as we can always just create an instance of the low-level client and pass it around but it is more convenient to define a connection globally so then all the APIs will work without us having to specify a connection:

from elasticsearch_dsl import connections

# creating a default connection here, any kwargs will be passed to the low level connection
connections.create_connection(hosts=['localhost:9200'])

# if we ever need the low-level client we can retrieve it too
es = connections.get_connection()

We have to make sure we always execute this code before using any of our examples here, in or our application in the real world. Typically the applications entry point is a good place for this.

For situations where you want to connect to multiple clusters you can always create multiple connections with different aliases that we can then use to refer to the different clusters in our code; you can see all the options in the configuration chapter of the docs.

Create Document

Now that we can connect to Elasticsearch it is time to design our document mappings and structure:


from elasticsearch_dsl import InnerDoc, Document, Object, Keyword, Date, Text, MetaField


class User(InnerDoc):
    """
    User is not a document, just an inner object we will be using to represent
    the author and committer for each commit in our repo.
    """
    name = Text(fields={'keyword': Keyword()})
    email = Keyword()

class Commit(Document):
    """
    Document storing information about a commit in git, notably the metadata
    and statistics about lines touched.
    """
    committed_date = Date()
    authored_date = Date()

    # author and committer are inner objects mapped using the User class
    committer = Object(User)
    author = Object(User)

    # parent commits
    parent_shas = Keyword(multi=True)
    # list of files touched by the commit
    files = Keyword(multi=True)

    message = Text()

    def subject(self):
        " Return first line of the git message. "
        return self.message.split('\n', 1)[0]

    class Meta:
        # we turn of dynamic mappings ignoring any fields that are not
        # explicitly mapped
        dynamic = MetaField(False)

    class Index:
        # any configuration of the index
        name = 'git-v1'
        settings = {
          'number_of_shards': 1
        }

(Honza Král) #2

As we can see we are again relying on the knowledge of Elasticsearch to tell us where to use Keyword vs Text, including the use of multi-fields where the syntax exactly copies the one in Elasticsearch. The only option that is python-only is the multi=True which signifies that this field is always expected to hold a list. Like Elasticsearch we place no limit on whether a field contains a single value or an array. Setting multi to True just makes a python list the default so that those fields would default to an empty list making it possible for us to just start appending items to it:

# create an empty commit object
c = Commit()
# accessing c.files gives us an empty list so we directly start appending
c.files.append('__init__.py')

Just defining the python class doesn't create the index or populate the mappings in Elasticsearch so we first need to do that:

Commit.init()

Now we have an index created in Elasticsearch, including the proper mappings and settings that we asked for in our Commit class. This is an important step and needs to always be done before we start indexing any data into Elasticsearch.

Next time we will learn how to load data into our newly created index and then how to run queries and aggregations, stay tuned!

Acknowledgements

I cannot talk or write about elasticsearch-dsl without acknowledging its origins, I borrowed a lot from other projects when designing this library and leaned heavily on the feedback from the excellent Python community. I would like to thank the Django project for inspiration and especially Will Kahn-Greene and Rob Hudson, the maintainers of now deprecated elasticutils for great feedback at the project's conception.

If you are interested in the details of the design process you can see the conference talk about the conception of elasticsearch-dsl here.