Search timeouts during indexing

Drept · September 14, 2021, 7:49am

I have an index of around 90 million documents in Elasticsearch. I manage it via django-elasticsearch-dsl. The settings are the following:

number_of_shards: 1
number_of_replicas: 0

The mapping is:

{
  "series" : {
    "mappings" : {
      "properties" : {
        "description" : {
          "type" : "text"
        },
        "all_actors" : {
          "type" : "text"
        },
        "episode_title" : {
          "type" : "text"
        },
        "actors_keyword" : {
          "type" : "keyword",
          "ignore_above" : 1000
        },
        "series_title" : {
          "type" : "keyword",
          "ignore_above" : 1000
        },
        "language" : {
          "type" : "keyword",
          "ignore_above" : 1000
        },
        "number_of_actors" : {
          "type" : "short"
        },
        "translated_title" : {
          "type" : "text"
        },
        "tags" : {
          "type" : "text"
        },
        "tags_keyword" : {
          "type" : "keyword",
          "ignore_above" : 1000
        },
        "url" : {
          "type" : "text"
        },
        "year" : {
          "type" : "short",
          "null_value" : 0
        }
      }
    }
  }
}

Now, the problem is: when I start adding new objects to PostgreSQL and signals are being sent to Elasticsearch, the search becomes extremely slow and starts throwing timeouts. The rate of adding new objects is moderate: around 300 objects per minute.

The same occurs when I delete even 100 objects from PostgreSQL. I am confident that this is due to the slow processing of signals. After some 30 seconds (I believe when the processings of the signals is over), everything works fine again, but if I have to process some 2 million objects over 2 weeks, Elasticsearch will be unresponsive for 2 weeks too (at least).

What I have tried to remedy the issue:

setting refresh_interval to "60s";
increasing the JVM heap size twice (to 4 GB out of the total 16 GB available);
profiling the data from the slow logs: the results show that the queries are really quick, so the problem is not in the queries;
checking the shard health: no issues identified;
doing the same operations with other – smaller – indices. This shows that there are no speed problems with the smaller indices (or they are hardly noticeable).

Could you help me understand what might be causing the timeouts in this case?

warkolm · September 14, 2021, 7:52am

What is the output from the _cluster/stats?pretty&human API?

Drept · September 14, 2021, 8:10am

Here it is in the gist: https://gist.github.com/Dterb/6812ffee42ebd8e714eeefd5a7d70033

warkolm · September 14, 2021, 8:13am

Thanks.

It's surprising that this is happening, what sort of storage does the node have?
Is there anything in your Elasticsearch logs?
What about your Elasticsearch slow logs and hot threads?
Are you using the Monitoring functionality?

Christian_Dahlqvist · September 14, 2021, 8:18am

What kind of hardware is the node deployed on? What type of storage are you using? Local SSD?

Drept · September 14, 2021, 8:23am

The last records in /etc/log/elasticsearch/elasticsearch.log are one-hour-old: just showing that I restarted Elasticsearch. The next records are 12-hour-old, so no, there is nothing strange in there.

Slow logs: multiple queries consuming 40+ seconds. When profiling them, the results show that the queries actually take from 1 to 4 seconds (as expected when no signals are being sent to the database).

Hot threads. Here are two prints in the gists: one, two.

Monitoring functionality. No, I have not been using it.

Drept · September 14, 2021, 8:27am

I am using DigitalOcean: Shared CPU, 8 vCPUs, 16 GB RAM, 160 GB SSD, 6 TB Transfer.

The indices are stored on the droplet's SSD. The tablespaces for the underlying database objects are stored on a separate block storage volume (their size is far above the 160 GB of the SSD).

Christian_Dahlqvist · September 14, 2021, 9:14am

So you have slow storage fronted by an SSD cache? In that case it is possible that indexing and merging may affect the cache, causing slower searches until it has again loaded the most queried data. I would recommend monitoring iowait during indexing and initial querying to see if this spikes. If it does you may need to switch to faster storage if you are to eliminate these performance issues.

Drept · September 14, 2021, 9:56am

I've tried to measure the I/O wait by means of the top command and the wa parameter it is showing.

In general, wa is dynamically changing from 0.3 to 5.5. When I am deleting hundreds of objects and am searching in Elasticsearch, it is rising to 7.5 to 10.0. The greatest spike I saw was 12.7 % Cpu(s).

Does this seem to be the problem? Or these figures are rather normal?

UPD. Actually, even several minutes after, when the indexing should have been over, wa still varies from 0.3 to 9.8.

UPD 2. The read - vda parameter is surging during these periods:

Christian_Dahlqvist · September 14, 2021, 10:10am

Then it sounds like I/O performance (or rather lack of it) is the cause. Make sure you are using bulk requests when indexing/updating as this may help some. To solve the problem you probably need better and faster storage though. Indexing and subsequent merging is often quite I/O intensive.

Drept · September 15, 2021, 8:49am

Thanks, your answered really helped.

The problem was in the frequency of refreshes (of course, looking more deeply, in I/O) and in the fact that django_elasticsearch_dsl does not use the Bulk API by default.

Even '60s' was too much for a large index, and the refreshes took all the 60 seconds or even more, which made the search unresponsive.

I had to:

disable all signals in django_elasticsearch_dsl (as thwy were triggering individual actions for every objects;
set refresh_interval to '-1' (i.e. to disable automatic refreshes).

Now, I am using bulk updates by specifying the objects that need to be indexed/deleted manually and am refreshing the indices once per week. That resolved the issue.

system · October 13, 2021, 8:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Timeout Elasticsearch	4	892	July 6, 2017
Timeout during bulk indexing when doc size increased Elasticsearch	2	337	July 6, 2017
Elasticsearch timeout for search query Elasticsearch	7	3668	May 30, 2020
Performance Issues and timeouts with Elasticsearch Elasticsearch	5	5929	January 11, 2017
Long delay between indexing a document and its availability in search results Elasticsearch	16	7198	November 17, 2017

Search timeouts during indexing

Related Topics