If you added the documents routing on the URL then you could add the same routing on the query and only hit a single shard. That is fairly similar in concept to using the URL as the document's ID but doesn't have problems with very very long urls.
If you are doing this for lots of URLs at once then you could do a multi search requests and get all the results back at once. You'd want to set size=0&terminate_after=1 which is what the _search/exists API does anyway. Actually, _search/exists is deprecated with a note that says to use size=0&terminate_after=1 so maybe that is a good idea anyway.
It'd go with the multisearch idea first because it is something you can do without changing your index. If that isn't good enough then look at the routing.
I tried the routing, but that does not make a difference. I'm also testing on a fairly tiny ES database with 200 documents.
I did not try multisearch yet, but have the feeling that maybe the HTTP request handling (albeit to 127.0.0.1) add most of the query time.
Using keep-alive might solve the problems here.
I debugged this issue a bit more: Keep-Alive was used automatically, so this was not the problem.
I am using PHP 5.5.9 with the HTTP_Request2 library and the socket adapter to send the _search/exists GET query with the body.
Switching to the Curl adapter broke, because curl apparently had an issue of sending GET request with a body. Switching to POST made it work with curl, too. Timing did change, too:
curl with POST: 2.8s
socket with POST: 18.8s
Then I optimized a bit and used the document's URL as identifier, which means I can now fetch the document with the URL: GET /document/$url:
curl with HEAD on the URL: 1.8s
socket with HEAD on the URL: 1.3s
Conclusion:
There is a performance problem in HTTP_Request2's socket adapter for requests with a body.
The curl adapter is 10x faster for POST requests
The socket adapter is 1.5x faster for HEAD requests.
This all had nothing to do with elastic search itself. Sorry.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.