Getting started with Python DSL
There are two official Elasticsearch clients for Python - elasticsearch-py and elasticsearch-dsl. The low-level elasticsearch-py
is a no-opinions client that provides a convenient way to talk to Elasticsearch - handling all the complexities of talking to a distributed system while preserving the simplicity of the REST APIs. Because of this approach it often does not feel too pythonic - it requires you to create complex dict
s that represent queries and doesn't offer any help (with the exception of the helpers for bulk
and scan
). For that there is elasticsearch-dsl
.
Elasticsearch DSL
The purpose of elasticsearch-dsl
is to provide an easier and more familiar way to work with Elasticsearch. It focuses on just the most common operations like search
and generally working with data - we should be able to design a structure of the document, including its mappings and any Python only bit like custom methods, and perform even a complex search without having to write the query as json (or even as a dict
) so that we can focus on the query we want to run instead of getting lost in curly braces.
What we still have to know and what the library won't help us with, is the meaning behind the different query types in Elasticsearch - things like what is the difference between match
and terms
query and what are the options for a multi_match
query. These are the things that we still need to decide. On the other hand the library can help us in creating these queries by providing a more convenient syntax:
from elasticsearch_dsl import Search
s = Search(index="i")
# same as {"match": {"title": "python"}}
s = s.query('match', title='python')
Note that we had to assign back to s
to keep our changes. That is because the methods on the Search
objects always return a cloned copy of the original search with the requested modification. That way you don't have to be afraid of having global search objects or passing them around.
We can also combine multiple queries together using logical operations (|
for OR
, &
for AND
, and unary ~
for negation) so we don't have to construct a bool
query manually:
# Q is a simple helper function to construct any query
from elasticsearch_dsl import Q
# same as {"bool": {"should": [{"match": {"title": "python}}, {"term": {"category": "python"}}]}}
query = Q('match', title='python') | Q('term', category='python')
# use this as a filter = wrap in {"bool": {"filter": []}}
Search(index="i").filter(query)
You can can call the to_dict()
method on any object provided by the library to inspect the underlying json
representation, that way you can make sure the library is doing what you expect it to:
print(query.to_dict())
print(Search().filter(query).to_dict())
Many of the objects also have a way to construct them from the original json
(in the form of a python dict
) which is useful for converting examples and also to migrate from the low-level client to the dsl:
s = Search.from_dict({
"query": {
"match": {"title": "python"}
}
})
# now we can simply add a filter without worrying where in the dict it would go
s = s.filter('term', category='search')
Example application
So let's have a look and create an example application indexing our git
history in Elasticsearch and performing some interesting queries.
To get started we have to first configure the library. This is an optional step as we can always just create an instance of the low-level client and pass it around but it is more convenient to define a connection globally so then all the APIs will work without us having to specify a connection:
from elasticsearch_dsl import connections
# creating a default connection here, any kwargs will be passed to the low level connection
connections.create_connection(hosts=['localhost:9200'])
# if we ever need the low-level client we can retrieve it too
es = connections.get_connection()
We have to make sure we always execute this code before using any of our examples here, in or our application in the real world. Typically the applications entry point is a good place for this.
For situations where you want to connect to multiple clusters you can always create multiple connections with different aliases that we can then use to refer to the different clusters in our code; you can see all the options in the configuration chapter of the docs.
Create Document
Now that we can connect to Elasticsearch it is time to design our document mappings and structure:
from elasticsearch_dsl import InnerDoc, Document, Object, Keyword, Date, Text, MetaField
class User(InnerDoc):
"""
User is not a document, just an inner object we will be using to represent
the author and committer for each commit in our repo.
"""
name = Text(fields={'keyword': Keyword()})
email = Keyword()
class Commit(Document):
"""
Document storing information about a commit in git, notably the metadata
and statistics about lines touched.
"""
committed_date = Date()
authored_date = Date()
# author and committer are inner objects mapped using the User class
committer = Object(User)
author = Object(User)
# parent commits
parent_shas = Keyword(multi=True)
# list of files touched by the commit
files = Keyword(multi=True)
message = Text()
def subject(self):
" Return first line of the git message. "
return self.message.split('\n', 1)[0]
class Meta:
# we turn of dynamic mappings ignoring any fields that are not
# explicitly mapped
dynamic = MetaField(False)
class Index:
# any configuration of the index
name = 'git-v1'
settings = {
'number_of_shards': 1
}