Performance of "exists" query

cweiske · September 1, 2016, 6:44pm

I've built a website search engine on top of Elasticsearch and have problems with crawling performance.

After extracting all URLs from a HTML page, I add them to ES. But in advance I check if a document with the given URL already exists:

GET /document/_search/exists
{"query":{"filtered":{"filter":{"term":{"url":"https:\/\/example.org\/"}}}}}

This takes 40ms, and since I do this for each of the URLs this adds up quite a bit.

My current mapping for the url field is:

    "url": {
        "type": "string",
        "index": "not_analyzed",
        "boost": 1.5
    },

Is there anything I can do to make this query faster?

nik9000 · September 1, 2016, 7:10pm

If you added the documents routing on the URL then you could add the same routing on the query and only hit a single shard. That is fairly similar in concept to using the URL as the document's ID but doesn't have problems with very very long urls.

If you are doing this for lots of URLs at once then you could do a multi search requests and get all the results back at once. You'd want to set size=0&terminate_after=1 which is what the _search/exists API does anyway. Actually, _search/exists is deprecated with a note that says to use size=0&terminate_after=1 so maybe that is a good idea anyway.

It'd go with the multisearch idea first because it is something you can do without changing your index. If that isn't good enough then look at the routing.

cweiske · September 2, 2016, 6:41am

I tried the routing, but that does not make a difference. I'm also testing on a fairly tiny ES database with 200 documents.

I did not try multisearch yet, but have the feeling that maybe the HTTP request handling (albeit to 127.0.0.1) add most of the query time.
Using keep-alive might solve the problems here.

cweiske · September 2, 2016, 8:58am

I debugged this issue a bit more: Keep-Alive was used automatically, so this was not the problem.

I am using PHP 5.5.9 with the HTTP_Request2 library and the socket adapter to send the _search/exists GET query with the body.

Switching to the Curl adapter broke, because curl apparently had an issue of sending GET request with a body. Switching to POST made it work with curl, too. Timing did change, too:

curl with POST: 2.8s
socket with POST: 18.8s

Then I optimized a bit and used the document's URL as identifier, which means I can now fetch the document with the URL: GET /document/$url:

curl with HEAD on the URL: 1.8s
socket with HEAD on the URL: 1.3s

Conclusion:

There is a performance problem in HTTP_Request2's socket adapter for requests with a body.
The curl adapter is 10x faster for POST requests
The socket adapter is 1.5x faster for HEAD requests.

This all had nothing to do with elastic search itself. Sorry.

nik9000 · September 2, 2016, 12:17pm

Using the URL as the document id is a bit dangerous because URLs can get
long but document IDs will have a limit in 5.0:

Topic		Replies	Views
Exists query is quiet slower than terms query Elasticsearch	2	512	September 19, 2019
Does there exists an exists query Elasticsearch	3	437	July 6, 2017
Exists query Elasticsearch	2	313	July 6, 2017
Exists query not working on some fields in my index (version 5.1.1) Elasticsearch	2	1209	May 10, 2017
Query help request Elasticsearch	2	466	May 19, 2017

Performance of "exists" query

Related topics