Search timeouts during indexing

I have an index of around 90 million documents in Elasticsearch. I manage it via django-elasticsearch-dsl. The settings are the following:

  • number_of_shards: 1
  • number_of_replicas: 0

The mapping is:

{
  "series" : {
    "mappings" : {
      "properties" : {
        "description" : {
          "type" : "text"
        },
        "all_actors" : {
          "type" : "text"
        },
        "episode_title" : {
          "type" : "text"
        },
        "actors_keyword" : {
          "type" : "keyword",
          "ignore_above" : 1000
        },
        "series_title" : {
          "type" : "keyword",
          "ignore_above" : 1000
        },
        "language" : {
          "type" : "keyword",
          "ignore_above" : 1000
        },
        "number_of_actors" : {
          "type" : "short"
        },
        "translated_title" : {
          "type" : "text"
        },
        "tags" : {
          "type" : "text"
        },
        "tags_keyword" : {
          "type" : "keyword",
          "ignore_above" : 1000
        },
        "url" : {
          "type" : "text"
        },
        "year" : {
          "type" : "short",
          "null_value" : 0
        }
      }
    }
  }
}

Now, the problem is: when I start adding new objects to PostgreSQL and signals are being sent to Elasticsearch, the search becomes extremely slow and starts throwing timeouts. The rate of adding new objects is moderate: around 300 objects per minute.

The same occurs when I delete even 100 objects from PostgreSQL. I am confident that this is due to the slow processing of signals. After some 30 seconds (I believe when the processings of the signals is over), everything works fine again, but if I have to process some 2 million objects over 2 weeks, Elasticsearch will be unresponsive for 2 weeks too (at least).

What I have tried to remedy the issue:

  • setting refresh_interval to "60s";
  • increasing the JVM heap size twice (to 4 GB out of the total 16 GB available);
  • profiling the data from the slow logs: the results show that the queries are really quick, so the problem is not in the queries;
  • checking the shard health: no issues identified;
  • doing the same operations with other – smaller – indices. This shows that there are no speed problems with the smaller indices (or they are hardly noticeable).

Could you help me understand what might be causing the timeouts in this case?

What is the output from the _cluster/stats?pretty&human API?

Here it is in the gist: https://gist.github.com/Dterb/6812ffee42ebd8e714eeefd5a7d70033

1 Like

Thanks.

It's surprising that this is happening, what sort of storage does the node have?
Is there anything in your Elasticsearch logs?
What about your Elasticsearch slow logs and hot threads?
Are you using the Monitoring functionality?

What kind of hardware is the node deployed on? What type of storage are you using? Local SSD?

The last records in /etc/log/elasticsearch/elasticsearch.log are one-hour-old: just showing that I restarted Elasticsearch. The next records are 12-hour-old, so no, there is nothing strange in there.

Slow logs: multiple queries consuming 40+ seconds. When profiling them, the results show that the queries actually take from 1 to 4 seconds (as expected when no signals are being sent to the database).

Hot threads. Here are two prints in the gists: one, two.

Monitoring functionality. No, I have not been using it.

I am using DigitalOcean: Shared CPU, 8 vCPUs, 16 GB RAM, 160 GB SSD, 6 TB Transfer.

The indices are stored on the droplet's SSD. The tablespaces for the underlying database objects are stored on a separate block storage volume (their size is far above the 160 GB of the SSD).

So you have slow storage fronted by an SSD cache? In that case it is possible that indexing and merging may affect the cache, causing slower searches until it has again loaded the most queried data. I would recommend monitoring iowait during indexing and initial querying to see if this spikes. If it does you may need to switch to faster storage if you are to eliminate these performance issues.

I've tried to measure the I/O wait by means of the top command and the wa parameter it is showing.

In general, wa is dynamically changing from 0.3 to 5.5. When I am deleting hundreds of objects and am searching in Elasticsearch, it is rising to 7.5 to 10.0. The greatest spike I saw was 12.7 % Cpu(s).

Does this seem to be the problem? Or these figures are rather normal?

UPD. Actually, even several minutes after, when the indexing should have been over, wa still varies from 0.3 to 9.8.

UPD 2. The read - vda parameter is surging during these periods:

Then it sounds like I/O performance (or rather lack of it) is the cause. Make sure you are using bulk requests when indexing/updating as this may help some. To solve the problem you probably need better and faster storage though. Indexing and subsequent merging is often quite I/O intensive.

1 Like

Thanks, your answered really helped.

The problem was in the frequency of refreshes (of course, looking more deeply, in I/O) and in the fact that django_elasticsearch_dsl does not use the Bulk API by default.

Even '60s' was too much for a large index, and the refreshes took all the 60 seconds or even more, which made the search unresponsive.

I had to:

  • disable all signals in django_elasticsearch_dsl (as thwy were triggering individual actions for every objects;
  • set refresh_interval to '-1' (i.e. to disable automatic refreshes).

Now, I am using bulk updates by specifying the objects that need to be indexed/deleted manually and am refreshing the indices once per week. That resolved the issue.