I have a custom native scoring script that is O(n) complexity. It takes a parameter, being a hash. It calculates a score for each document based on the euclidean distance between the hash parameter and the hash in the document.
I have a 2-node es cluster with ~150million documents in an index that has doc_values turned on.
When I run an exists query that targets about 1/3rd of these documents with the scoring script being applied, the results take about 5-7 minutes to come back. Whilst it's running, I can see the disk utilisation going nuts on netdata (https://github.com/firehol/netdata)
When I run it a second time with a different hash parameter on the scoring script, it takes approximately 1/3rd the time and the disk utilisation only goes nuts for the duration of the query.
When I run it a third time, again with a different hash parameter on the scoring script, it takes a few seconds and the disk utilisation spike is minimal.
I know from previous tests that after a while, the query will go back to taking a long time again.
I'm really curious to know what is going on here:
Is it taking less time because the fielddata is being loaded into memory?
Is it shard caching kicking in?
If that's true, why doesn't the second query take a few seconds?
What's causing the cool-down?
At what point will the cool-down process kick in?
I'd love to know what process the internal parts are using to make this all work like it does. If anyone could provide insight, it'd not only be very useful, but also fascinating.
Ha! No! The amount of time taken isn't due to the exists query, but more due to the O(n) nature of the scoring script. We're running v2.3.4 in this instance.
Ok, so are you suggesting that it's actually filesystem caching that speeds the query up on re-execution? And potentially cache eviction causes it to slow back again when those doc value files aren't being accessed?
I'm still assuming that maybe shard caching is doing something. I'm just confused as to why it takes three executions to get the optimum speed for the query...
If you are indexing while searching then background housekeeping tasks (segment merging) will be reorganizing the docvalues files causing them to drop out of file system cache
That's interesting. I was indexing at the same time.
I have another cluster run up with a snapshot that is not being indexed into. I'll run the query again against that and see how it performs.
Let's try something else that helps rule your script out but still does the same disk accesses.
If you add a terms aggregation on the same field(s) your script accesses we will touch the same data structures your script does and for the same set of docs.
To be honest, I'm less concerned about the actual disk utilisation and more interested in the the variation in the query speed itself. At first, I thought that query speed was directly related to disk utilisation - hence the title of this post.
After some tests I did last night I expect there is a correlation, but they're not totally related. The query doesn't return for some time after work on the disk is done.
I'm just interested to know why the query takes a few executions to "warm up". I'm wondering if it's related to shard caching and that the second query is slower than the third because it queries different shards, so they still have to load up their cache? That's total conjecture and a massively wild guess, but as I tune the query, configuration and underlying hardware I really want to understand the behaviour.
I'll still do the term query tests and see what happens there too.
If you have replicas then yes, that might explain some of the differences. First query may hit the primary copy of a shard on node1 and load a cache while second query may choose to hit the other replica of a shard held on node 2.
To make sure you hit the same copy of a shard each time you test you could try ask for the primary copy as a preference
[2017-02-06 12:10:08,485][WARN ][bootstrap ] Unable to lock JVM Memory: error=12,reason=Cannot allocate memory
[2017-02-06 12:10:08,486][WARN ][bootstrap ] This can result in part of the JVM being swapped out.
[2017-02-06 12:10:08,486][WARN ][bootstrap ] Increase RLIMIT_MEMLOCK, soft limit: 65536, hard limit: 65536
[2017-02-06 12:10:08,486][WARN ][bootstrap ] These can be adjusted by modifying /etc/security/limits.conf, for example:
# allow user 'elasticsearch' mlockall
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited
When I set vm.swappiness = 10 then I get better utilisation results, compare: Before:
If I specify I preference as _primary_first then the utilisation graph for the second query looks much like the third previously. And both of these are with indexing happening at the same time as well. I need to test whether it would be on second query if I hadn't specified the preference but with swappiness turned down.
Either way, I feel like I'm heading in the right direction now. Thanks!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.