DOs and DON'Ts when initializing data for integration tests with Elasticsearch
While creating software that should use Elasticsearch for search, data aggregation or BM25/vector/ search, it's vital to create at least a handful of integration tests. While “mocking indices” might seem tempting, because tests can be run even in fraction of a second, what they actually test, is not interaction with real Elasticsearch, but our imagination about Elasticsearch. This might get brutally verified in production, especially after the cluster update
To mitigate the most obvious drawback of integration tests, it’s crucial to initialize Elasticsearch with data in the way that's perhaps not optimal for daily production scenarios, but works efficiently for test setup.
Don’t recreate the container
It is likely that testing your functionality relying on Elasticsearch will take very little time, like fractions of a second. Then restarting Elasticsearch between the tests isn’t a wise idea, because you’ll spend dozens of seconds extra, just waiting for ES to be started.
Simply start your Elasticsearch once before the testing, clean after each test and initialize data before each test.
Hint: if you’re using the Testcontainers module for Elasticsearch e.g. in Java, make sure the field is @Container static
or at least started in @BeforeAll
.
cURL is your friend before tests
Using client libraries in your production code (which we’re testing) is a wise choice. However, when preparing the environment for tests, a more hacky approach might have benefits, because the needs for production use cases and test data setup aren’t 100% the same. Using cURL to manage data in Elasticsearch is not a rocket science, as we saw in a previous post: Dec 2nd: [EN] How to cURL Elasticsearch: Go forth to Shell.
One extra benefit is that cURL is programming-language-agnostic, so the tests can be easier to understand for people who come from different tech stacks.
Using cURL from Testcontainers isn’t much more difficult than Bash, e.g. if you need to delete the books index, it can be done as:
elasticsearch.execInContainer(
"curl", "https://localhost:9200/books", "-u", "elastic:changeme",
"--cacert", "/usr/share/elasticsearch/config/certs/http_ca.crt",
"-X", "DELETE"
)
Batch as much as you can
Indexing single documents makes sense in a lot of cases, but loading test data isn’t one of them. Instead of making 1000 requests to index one document each, simply run a single request with 1000 documents to _bulk. Even with Testcontainers it’s not a rocket science:
elasticsearch.execInContainer(
"curl", "https://localhost:9200/_bulk?refresh=true", "-u", "elastic:changeme",
"--cacert", "/usr/share/elasticsearch/config/certs/http_ca.crt",
"-X", "POST",
"-H", "Content-Type: application/x-ndjson",
"--data-binary", "@/tmp/books.ndjson"
)
With this approach you can even add documents to many indices in one call!
Be local
CPU’s cache is much faster than RAM, local storage is usually faster than network. If you have ten use cases relying on the same data set, there’s really no point in sending the same data ten times to the same container (because we don’t create a new one for each test, right?)
For this reason, when creating your container, toss .withCopyToContainer(...)
, so you can copy the file to container once, and then just use _bulk
like in the step above. It looks like this:
static ElasticsearchContainer elasticsearch =
new ElasticsearchContainer(ELASTICSEARCH_IMAGE)
.withCopyToContainer(MountableFile.forHostPath("src/test/resources/books.ndjson"), "/tmp/books.ndjson");
This makes sense especially in setups (like CI), where the container runtime isn’t local but is injected from a different machine.
Recap
The ideas presented here are a reminder that the ever-green IT mantra, Don’t Repeat Yourself, applies also to initializing test data. Keep your data in bulk and local, and you’re good to shave a decent amount of time from the execution of your integration tests. For more insights, feel free to explore Github repo with more examples and branches.