Performance of "exists" query

I've built a website search engine on top of Elasticsearch and have problems with crawling performance.

After extracting all URLs from a HTML page, I add them to ES. But in advance I check if a document with the given URL already exists:

GET /document/_search/exists
{"query":{"filtered":{"filter":{"term":{"url":"https:\/\/example.org\/"}}}}}

This takes 40ms, and since I do this for each of the URLs this adds up quite a bit.

My current mapping for the url field is:

    "url": {
        "type": "string",
        "index": "not_analyzed",
        "boost": 1.5
    },

Is there anything I can do to make this query faster?

If you added the documents routing on the URL then you could add the same routing on the query and only hit a single shard. That is fairly similar in concept to using the URL as the document's ID but doesn't have problems with very very long urls.

If you are doing this for lots of URLs at once then you could do a multi search requests and get all the results back at once. You'd want to set size=0&terminate_after=1 which is what the _search/exists API does anyway. Actually, _search/exists is deprecated with a note that says to use size=0&terminate_after=1 so maybe that is a good idea anyway.

It'd go with the multisearch idea first because it is something you can do without changing your index. If that isn't good enough then look at the routing.

1 Like

I tried the routing, but that does not make a difference. I'm also testing on a fairly tiny ES database with 200 documents.

I did not try multisearch yet, but have the feeling that maybe the HTTP request handling (albeit to 127.0.0.1) add most of the query time.
Using keep-alive might solve the problems here.

I debugged this issue a bit more: Keep-Alive was used automatically, so this was not the problem.

I am using PHP 5.5.9 with the HTTP_Request2 library and the socket adapter to send the _search/exists GET query with the body.

Switching to the Curl adapter broke, because curl apparently had an issue of sending GET request with a body. Switching to POST made it work with curl, too. Timing did change, too:

  • curl with POST: 2.8s
  • socket with POST: 18.8s

Then I optimized a bit and used the document's URL as identifier, which means I can now fetch the document with the URL: GET /document/$url:

  • curl with HEAD on the URL: 1.8s
  • socket with HEAD on the URL: 1.3s

Conclusion:

  • There is a performance problem in HTTP_Request2's socket adapter for requests with a body.
  • The curl adapter is 10x faster for POST requests
  • The socket adapter is 1.5x faster for HEAD requests.

This all had nothing to do with elastic search itself. Sorry.

Using the URL as the document id is a bit dangerous because URLs can get
long but document IDs will have a limit in 5.0: