Disk utilisation behaviour when using a custom scoring script

I have a custom native scoring script that is O(n) complexity. It takes a parameter, being a hash. It calculates a score for each document based on the euclidean distance between the hash parameter and the hash in the document.

I have a 2-node es cluster with ~150million documents in an index that has doc_values turned on.

When I run an exists query that targets about 1/3rd of these documents with the scoring script being applied, the results take about 5-7 minutes to come back. Whilst it's running, I can see the disk utilisation going nuts on netdata (https://github.com/firehol/netdata)

When I run it a second time with a different hash parameter on the scoring script, it takes approximately 1/3rd the time and the disk utilisation only goes nuts for the duration of the query.

When I run it a third time, again with a different hash parameter on the scoring script, it takes a few seconds and the disk utilisation spike is minimal.

I know from previous tests that after a while, the query will go back to taking a long time again.

I'm really curious to know what is going on here:

  1. Is it taking less time because the fielddata is being loaded into memory?
  2. Is it shard caching kicking in?
  3. If that's true, why doesn't the second query take a few seconds?
  4. What's causing the cool-down?
  5. At what point will the cool-down process kick in?

I'd love to know what process the internal parts are using to make this all work like it does. If anyone could provide insight, it'd not only be very useful, but also fascinating.

Many thanks for your time.

Whoah. You're not running an old version are you? The exists query got a major performance boost in 1.3.

Ha! No! The amount of time taken isn't due to the exists query, but more due to the O(n) nature of the scoring script. We're running v2.3.4 in this instance.

Bear in mind that because we hit docvalues hard now and they are on disk we are relying on the operating system's file system cache for a lot of this.

Ok, so are you suggesting that it's actually filesystem caching that speeds the query up on re-execution? And potentially cache eviction causes it to slow back again when those doc value files aren't being accessed?

I'm still assuming that maybe shard caching is doing something. I'm just confused as to why it takes three executions to get the optimum speed for the query...

If you are indexing while searching then background housekeeping tasks (segment merging) will be reorganizing the docvalues files causing them to drop out of file system cache

That's interesting. I was indexing at the same time.
I have another cluster run up with a snapshot that is not being indexed into. I'll run the query again against that and see how it performs.

After trying again on a system that is not indexing, it doesn't appear to change disk utilisation.

The initial query:

The second and third execution (second is the big lump between round 19:12:15 and 19:14:35, the third is the tiny spike at 19:17:00:

Let's try something else that helps rule your script out but still does the same disk accesses.

If you add a terms aggregation on the same field(s) your script accesses we will touch the same data structures your script does and for the same set of docs.

Sure, I can do that.

To be honest, I'm less concerned about the actual disk utilisation and more interested in the the variation in the query speed itself. At first, I thought that query speed was directly related to disk utilisation - hence the title of this post.
After some tests I did last night I expect there is a correlation, but they're not totally related. The query doesn't return for some time after work on the disk is done.

I'm just interested to know why the query takes a few executions to "warm up". I'm wondering if it's related to shard caching and that the second query is slower than the third because it queries different shards, so they still have to load up their cache? That's total conjecture and a massively wild guess, but as I tune the query, configuration and underlying hardware I really want to understand the behaviour.

I'll still do the term query tests and see what happens there too.

If you have replicas then yes, that might explain some of the differences. First query may hit the primary copy of a shard on node1 and load a cache while second query may choose to hit the other replica of a shard held on node 2.

To make sure you hit the same copy of a shard each time you test you could try ask for the primary copy as a preference

Thanks for that!

Interestingly, even though I've configured mlockall in limits.d/elasticsearch.conf like so:

elasticsearch   soft    memlock unlimited
elasticsearch   hard    memlock unlimited

it doesn't like it's being applied, despite the process is running as the elasticsearch user:

ubuntu@server:~$ ps aux | grep java
ubuntu    1312  0.0  0.0  12916  1032 pts/0    R+   11:12   0:00 grep --color=auto java
elastic+ 25873 33.5 50.4 202993832 15593928 ?  Ssl  Feb06 463:33 /usr/bin/java -Xms14g -Xmx14g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Djna.nosys=true -Des.path.home=/usr/share/elasticsearch -cp /usr/share/elasticsearch/lib/elasticsearch-2.3.4.jar:/usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch start -Des.pidfile=/var/run/elasticsearch/elasticsearch.pid -Des.default.path.home=/usr/share/elasticsearch -Des.default.path.logs=/var/log/elasticsearch -Des.default.path.data=/var/lib/elasticsearch -Des.default.path.conf=/etc/elasticsearch

The errors are still there in the logs:

[2017-02-06 12:10:08,485][WARN ][bootstrap                ] Unable to lock JVM Memory: error=12,reason=Cannot allocate memory
[2017-02-06 12:10:08,486][WARN ][bootstrap                ] This can result in part of the JVM being swapped out.
[2017-02-06 12:10:08,486][WARN ][bootstrap                ] Increase RLIMIT_MEMLOCK, soft limit: 65536, hard limit: 65536
[2017-02-06 12:10:08,486][WARN ][bootstrap                ] These can be adjusted by modifying /etc/security/limits.conf, for example: 
	# allow user 'elasticsearch' mlockall
	elasticsearch soft memlock unlimited
	elasticsearch hard memlock unlimited

When I set vm.swappiness = 10 then I get better utilisation results, compare:
Before:


After:

If I specify I preference as _primary_first then the utilisation graph for the second query looks much like the third previously. And both of these are with indexing happening at the same time as well. I need to test whether it would be on second query if I hadn't specified the preference but with swappiness turned down.

Either way, I feel like I'm heading in the right direction now. Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.