TimeOut Error during indexing

I have installed EleasticSearch using the instructions from here. I run it using systemd. I have also made the following changes in the configuration:

  • Disabling swap
  • Change heap size to 20 GB
  • Set refresh interval to -1

Here's the config file:

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: search
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: myserver
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /extract/elasticsearch/data
#path.data: /var/data/elasticsearch
#
# Path to log files:
#
path.logs: /extract/elasticsearch/log
#path.logs: /var/log/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#
bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 0
#
# Set a custom port for HTTP:
#
#http.port: 9201
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.seed_hosts: ["0.0.0.0","myserver"]
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
cluster.initial_master_nodes: ["myserver"]
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true

I have already indexed 2.0 billion documents. However, during indexing or sometimes searching, I receive timeout error. I index files using multiprocssing with python. I use the python package of ElastichSearch and index files using:

streaming_bulk(client=client, index="corpora", actions=generator(), request_timeout=60)

(inserting document one by one has the same issue, searching has sometimes the same issue too). The server that I am using has 180 GB of memory and 88 CPUs. ElasticSearch is not the only process on this server but it is the major one.

Here's the output of getting index status using this command curl -X GET "myserver:9200/_cat/indices":

yellow open   corpora-split     tMdjFHfMR1OYfM_WptI0Kw   5   1 2005293875            0    672.2gb        672.2gb

Is there any suggestion on how can I avoid timeout errors?

I would suggest using more than one shard for this, there is a 2^32-1 doc limit on a single shard anyway.

Yes, I just updated the question. Actually, I'm using an index with 5 shards and it has already indexed 2.0 B documents.

Ok. What do the Elasticsearch logs show?

672GB across 5 shards mean each shard is quite large. As you seem to be using a single node this may be fine as no relocations are likely. Although the ideal shard size will depend on use case, data and queries the recommended size if often around 50GB or so.

1 Like

After increasing the request_timeout in the following command from 60 to 600 the issue was resolved and I never got a time-out error.

streaming_bulk(client=client, index="corpora", actions=generator(), request_timeout=60)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.