[Node.js] Elasticsearch search latency degrades over time (within hours)

Problem

Elasticsearch search latency degrades over time (from ~2 ms to fluctuations to 50-100 ms until we get circuit_breaking_exceptions.

Current Fix

When we restart our Node.js process (runs within Kubernetes on AWS) latency drops to ~2 ms again until the problem shows its ugly face again.

Stack

Node.js 14, latest elasticsearch-js client, latest Express, Alpine Linux, etc.

We've tried both CPU Optimized and High IO optimized.

Scale

We have around 100-150 requests per second to the ES cluster. Every query is unique with a unique latitude+longitude. I could show the mappings and queries on request but I'm thinking it's not related to them since a Node.js restart "fixes" it. If our queries were non-performant we ought to see it all the time, consistently, I'm thinking.

Solution

We reindex all of our documents every 20 minutes and then update aliases in a cronjob. We have three indexes, one with ~1k documents, one with ~60k documents and one with around 60-70k documents.

Our client:

const {Client} = require('@elastic/elasticsearch');

const options = {
    maxRetries: 1,
    requestTimeout: 1500,
    agent: {
        timeout: 1500
    },
    cloud: {
        id: process.env.ELASTICSEARCH_CLOUD_ID
    },
    auth: {
        username: process.env.ELASTICSEARCH_USERNAME,
        password: process.env.ELASTICSEARCH_PASSWORD
    }
};

module.exports = new Client(options);

Index Settings

await es.indices.create({
    index: indexName,
    body: {
        settings: {
            'number_of_shards': 1,
            'number_of_replicas': 1,
            'refresh_interval': -1,
            'index.translog.sync_interval': '30s', // We've tried without this too
            'index.translog.durability': 'async', // We've tried without this too
            'analysis': analysis
        },
        mappings
    }
});

The reason for refresh_interval -1 is that we refresh manually in the cronjob once it's done.

Symptoms

The big drops in latency are when we redeploy Node.js.


What can cause this? Restarting Node.js over and over again doesn't really seem like a solution :sweat_smile:

Hello!
I'm investigating the issue you are facing, and as soon as I have updates I'll let you know.

A couple of questions:

  • Are you by any chance creating multiple instances of the client? In that case, the connection will not be reused and this might cause the high number of open requests.
  • Why are you setting the agent timeout to 1500ms?

Nice - thanks!

Nope, a single instance per server. Currently, running between 4-8 servers depending on load.

The timeout setting is out of desperation. Initially, we didn't configure the agent at all but the timeout is an attempt to "fix" our issue. I suspected that the latency issue might be because we had sockets lying around.

Here's another screenshot from a redeploy I just did:

It's so weird how the latency just drops by redeploying node.js.

It seems like I've fixed our issue by completely circumventing the @elastic/elasticsearch client and making this tiny search function:

const axios = require('axios');

const timeout = 2500;

module.exports = ({index, body}) => {
    return axios({
        method: 'post',
        url: `${process.env.ELASTICSEARCH_NODE}/${index}/_search`,
        timeout,
        params: {
            request_cache: true,
            allow_partial_search_results: false,
            timeout: `${timeout}ms`
        },
        data: body
    }).then((res) => {
        return {body: res.data};
    });
};

This completely removes our latency issues. It points to the problem being in the official Node.js client.

It seems like our new "client" didn't fix it. It just postponed the issue. I'm starting to think this might not even be related to Elasticsearch but something on our side that builds up over time.

I'm still investigating this issue, but it looks like to me that there is a bug in the keep-alive agent the client is using (Node.js's core one).

To avoid creating a new connection every time and put Elasticsearch under too much pressure, we want to reuse connections, and for doing that, we use a keep-alive agent.
I think that the high rate of requests you are doing, combined with the socket timeout not configured or configured with a too high value, during time will lead to the issue you are facing.
By looking at the keep-alive agent code, it's using a FIFO queue for handling the free sockets. If you are using them too often (and the response time is high enough), the agent will never close them, and you will reach the maximum capacity of the agent (512 sockets by default).
Once you kill the process, also all the sockets are being closed, and as a consequence, Elasicsearch has lower pressure.

I'm experimenting with the agent, and it looks like that a LIFO queue would fix the issue. Unfortunately, as far as I know, this is a bug that bites every agent implementation, even community ones.
Very likely, this issue needs to be fixed in Node core.

I think that the reason why your fix didn't solve the issue, it's because Axios does not use a keep-alive agent, and you were creating too many sockets during time.

I'll continue to work for fixing this issue, but it will take some time.
Meanwhile, I think you have three ways of mitigating this issue:

  • Try to lower the requests rate
  • Try to lower the maxSockets and maxFreeSockets and set an aggressive timeout (I would try no more than 500ms)
  • Destroying and recreating the client Connection instance periodically, so you don't need to restart the whole process.

@delvedor Thanks for looking into this. If it's using a FIFO it would explain why things deteriorate over time since we have a high request rate per second.

I'm deploying your suggestions now with a promise queue with maxConcurrent=10, maxQueue=200 and agent settings:

keepAlive: true,
keepAliveMsecs: 500,
timeout: 500,
maxFreeSockets: 10,
maxSockets: 150

This is our new client:

const os = require('os');
const axios = require('axios');
const axiosRetry = require('axios-retry');
const Agent = require('agentkeepalive').HttpsAgent;
const Queue = require('promise-queue');
const apiErrors = require('../api-errors');
const pkg = require('../../package');

const timeout = 2000;

const maxConcurrent = 10;

const maxQueue = 200;

const userAgent = `${pkg.name}[${process.env.NODE_ENV}]/${pkg.version} (${os.platform()} ${os.release()}-${os.arch()}; Node.js ${process.version})`;

const client = axios.create({
    method: 'post',
    timeout,
    maxContentLength: 50 * 1000 * 1000,
    maxRedirects: 10,
    headers: {
        'User-Agent': userAgent,
        'Accept-Encoding': 'gzip'
    },
    httpsAgent: new Agent({
        timeout,
        maxFreeSockets: 10,
        maxSockets: 100,
        freeSocketTimeout: 1000 * 30,
        socketActiveTTL: 1000 * 60 * 10
    })
});

const queue = new Queue(maxConcurrent, maxQueue);

axiosRetry(client, {
    retries: 3,
    retryDelay: axiosRetry.exponentialDelay
});

module.exports = async ({index, body}) => {
    return queue.add(() => {
        return client({
            url: `${process.env.ELASTICSEARCH_NODE}/${index}/_search`,
            params: {
                request_cache: false,
                track_total_hits: false,
                allow_partial_search_results: false,
                timeout: `${timeout}ms`
            },
            data: body
        }).then((res) => {
            return {body: res.data};
        }).catch((error) => {
            if (!error.response) {
                throw apiErrors.elasticsearchTimeout();
            } else {
                throw apiErrors.elasticsearchFailed({
                    message: error.response.data.error.type,
                    details: error.response.data.error.reason,
                    status: error.response.data.status
                });
            }
        });
    });
};

Let's see what happens. Using agentkeepalive might help us here because it will kill sockets.

I would recommend using the official client instead of Axios. You can configure a custom keep-alive agent easily:

const { Client } = require('elastic/elasticsearch')
const Agent = require('agentkeepalive').HttpsAgent
const client = new Client({
  cloud: { id: 'id' },
  auth: { username: 'username', password: 'password' },
  agent () {
    return new Agent({
      timeout: 500,
      maxFreeSockets: 10,
      maxSockets: 100,
      freeSocketTimeout: 500,
      socketActiveTTL: 1000 * 60 * 10
    })
  }
})

Regarding the queue system, I would recommend using p-queue instead.

I'll make sure to revert to the official client. Why p-queue instead?

As far as I know, p-queue is safer against unhandled rejections, it is actively maintained, and it should be more performant.
Take a look at this article to know more! :slight_smile:

Gotcha! Thanks for letting me know.

Deploying my Axios client from before didn't help. Deploying your code now:

client.js:

const Agent = require('agentkeepalive').HttpsAgent;
const {Client} = require('@elastic/elasticsearch');

module.exports = new Client({
    cloud: {
        id: process.env.ELASTICSEARCH_CLOUD_ID
    },
    auth: {
        username: process.env.ELASTICSEARCH_USERNAME,
        password: process.env.ELASTICSEARCH_PASSWORD
    },
    maxRetries: 2,
    requestTimeout: 2000,
    agent: () => {
        return new Agent({
            timeout: 500,
            maxFreeSockets: 10,
            maxSockets: 100,
            freeSocketTimeout: 500,
            socketActiveTTL: 1000 * 60 * 10
        });
    }
});

search.js

const es = require('./client');
const {default: PQueue} = require('p-queue');
const apiErrors = require('../api-errors');

const timeout = 2000;

const queue = new PQueue({
    concurrency: 10,
    throwOnTimeout: true,
    timeout
});

module.exports = async ({index, body}) => {
    return queue.add(() => {
        return es.search({
            index,
            body,
            request_cache: false,
            track_total_hits: false,
            allow_partial_search_results: false,
            timeout: `${timeout}ms`
        }).catch((error) => {
            if (!error.meta) {
                throw apiErrors.elasticsearchTimeout();
            } else {
                throw apiErrors.elasticsearchFailed({
                    message: error.meta.body.error.type,
                    details: error.meta.body.error.reason,
                    status: error.meta.statusCode
                });
            }
        });
    });
};

Noooo! Didn't help either. Now trying this....

const https = require('https');
const {Client} = require('@elastic/elasticsearch');
const Queue = require('promise-queue');
const apiErrors = require('../api-errors');

const queue = new Queue(10, 200);

const ttl = 1000 * 60 * 10;

let lastTime = Date.now();

let client;

const createClient = () => {
    return new Client({
        cloud: {
            id: process.env.ELASTICSEARCH_CLOUD_ID
        },
        auth: {
            username: process.env.ELASTICSEARCH_USERNAME,
            password: process.env.ELASTICSEARCH_PASSWORD
        },
        maxRetries: 2,
        requestTimeout: 2000,
        agent: () => {
            return new https.Agent({
                keepAlive: true,
                timeout: 500,
                maxFreeSockets: 10,
                maxSockets: 50
            });
        }
    });
};

const getClient = () => {
    const now = Date.now();
    const currentClient = client;

    if (!client) {
        client = createClient();
    } else if (currentClient && (now - lastTime) > ttl) {
        lastTime = now;
        client = createClient();

        currentClient.close();
    }

    return client;
};

module.exports = async ({index, body}) => {
    return queue.add(() => {
        return getClient().search({
            index,
            body,
            request_cache: false,
            track_total_hits: false,
            allow_partial_search_results: false,
            timeout: '2000ms'
        }).catch((error) => {
            if (!error.meta) {
                throw apiErrors.elasticsearchTimeout();
            } else {
                throw apiErrors.elasticsearchFailed({
                    message: error.meta.body.error.type,
                    details: error.meta.body.error.reason,
                    status: error.meta.statusCode
                });
            }
        });
    });
};

Reverted to promise-queue for now. Looks like p-queue has some memory leaks/differences I don't have time to look into.

Damn. It still seems to be an issue:

Getting timeouts from time to time now too.

I fear it didn't work because of the FIFO scheduling I've mentioned above, you could try with an even lower timeout, maybe 50ms?

I've opened a pr in Node.js core with a proposal for introducing a different scheduling solution, you can find it here.

Meanwhile, you can try to use a patched agent to see if it mitigates the problem you are seeing, you can find it in this gist.

Given the high number of searches you are doing, I would recommend using the msearch API, which sends multiple requests at the same time.

Furthermore, If you are indexing big amounts of data with Node, I would also recommend installing the 7.7.0-rc2 version of the client with npm i @elastic/elasticsearch@next which provides a set of helpers. The bulk helper could help you improve the performances and lowering the number of sockets. You can find the documentation here.

Anyhow, I'm puzzled that after destroying and recreating the client you are still seeing this issue. I'm wondering what it's happening in the cluster during the spikes.

Thanks!

I've tried your patched agent with no effect:

Screen Shot 2020-05-07 at 12.33.39

I'm going to deploy msearch to production now and cross my fingers.

I have an update on our issue.

First off, I'd like to thank you for your support and help. This issue has affected us for five weeks and it was very stressful. We felt like we tried anything. We tried all of your suggestions and nothing worked.

By accident, I stumble upon the issue though. We had a memory leak that in practice resulted in the payload we sent to our Elasticsearch cluster increased by a few bits for every request. That resulted in the increased latency, which makes sense. It was, however, hard for us to see what the issue was. We just replaced our entire backend so we couldn't point fingers at a specific line since every line was new. We didn't get any error messages saying payload too big and we didn't have any instrumentation in place to catch this particular issue.

I guess this is an edge case. One possible solution to this is to chart avg payload size/content length sent to the Elasticsearch cluster on your metrics page. Another solution is not to make the leak in the beginning :sweat_smile: A third could be to warn when we send identifical sort objects to ES (that was the leak).

I think your patch to Node.js still makes sense re LIFO/FIFO and we'll use that patch once it becomes available. Because we still might have that issue you were trying to solve. Right now, our metrics like fine though.

Sorry for "wasting" your time chasing a bug that ultimately was due to bad code on our side. I wish I had known.

Hello! No need to apologize, I'm very glad you were able to sort it out!
If it helps for the future, the client is observable and allows you to understand what's going on behind the scenes. Obviously, the hardest part is to figure out what you want to observe, which changes based on the problem you are facing.