I'm using ElasticSearch 5.2.1 (using 4GB ram, single node on a 16 GB machine with SSD) and have the following problem:
I'm using the Java client to communicate with ElasticSearch. One of my Maven projects has a stress test that runs as part of the Maven build. The test performs the following steps:
0. Clean restart of ElasticSearch
Insert 10000 documents that represent "users". The users are stored in bulk.
Insert 10000 documents that represent "applications". These applications are linked to users and the user id is stored as part of the application document. To get the id a query is performed on the user documents by key. This means that to create 10000 application documents 10000 user queries are performed. Once the application documents are prepared they are stored in bulk.
After these actions the test is finished and the ElasticSearch connection is closed.
The weird thing is that ElasticSearch start to consume CPU on a constant basis (between 8-30 %) keep on doing this unless I stop the ElasticSearch process and restart it again. Then everything is back to normal. Is this a bug?
When I perform only the first step of the test (Insert the 10000 user documents) even multiple times I don't get this strange behavior. It seems as if the quick succession of queries causes the CPU hogging problem.
I know I can easily optimize the application with a small cache but that not the purpose of this test. The goal is here to stress ElasticSearch.
One more thing, when i wait long enough (around 30 minutes!) then the CPU usage drops back to normal.
I have attached a picture of the CPU and heap usage taken from VisualVM. The CPU graph shows the constant CPU use after the peak that represents the bulk store (between 06:30:00 and 06:30:30)
a couple of suggestions to help you find out what's the problem:
Can you correlate the actions of your tests to these graphs? (i.e. when does the each step start and end?)
When Elasticsearch is consuming a lot of CPU, you can regularly run hot_threads against it to find out what it is actually doing.
Is the following part of your test or was it just easier to implement?
To get the id a query is performed on the user documents by key. This means that to create 10000 application documents 10000 user queries are performed.
I just wonder about this, because you can do this much cheaper with just one match_all query and a scroll:
As I said this is a stress test. I know this can be implemented a lot
better but that is not the point. I want to stress ElasticSearch on purpose.
The graph shows a peak from 6:30:00- 6:30:30 or a timeframe of 30 seconds
that correlates with my test. Afterwards I don't do anything anymore but
during half an hour ElasticSearch keeps consuming a lot of CPU (8-30 %).
I see no reason for this behavior. As you also an see there is almost no GC
activity.
ok, I was not sure whether this was really part of your stress test. To be honest, I think it's good to do stress tests (this is what I do at Elastic almost day in and day out after all) but shouldn't the stress test be realistic, i.e. stress Elasticsearch in a way that is important to you? What does it buy you to know that Elasticsearch is acting strangely when you issue 10.000 queries but you don't use it in such a way in production? If you're interested in how we run benchmarks you should checkout Rally (of which I'm the main author).
To analyze what's going on I still suggest you issue a few hot_threads calls. This shows you what's going on.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.