I have installed EleasticSearch using the instructions from here. I run it using systemd
. I have also made the following changes in the configuration:
- Disabling swap
- Change heap size to 20 GB
- Set refresh interval to -1
Here's the config file:
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
# Before you set out to tweak and tune the configuration, make sure you
# understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: search
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: myserver
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /extract/elasticsearch/data
#path.data: /var/data/elasticsearch
#
# Path to log files:
#
path.logs: /extract/elasticsearch/log
#path.logs: /var/log/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#
bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 0
#
# Set a custom port for HTTP:
#
#http.port: 9201
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.seed_hosts: ["0.0.0.0","myserver"]
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
cluster.initial_master_nodes: ["myserver"]
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
I have already indexed 2.0 billion documents. However, during indexing or sometimes searching, I receive timeout error. I index files using multiprocssing with python. I use the python package of ElastichSearch and index files using:
streaming_bulk(client=client, index="corpora", actions=generator(), request_timeout=60)
(inserting document one by one has the same issue, searching has sometimes the same issue too). The server that I am using has 180 GB of memory and 88 CPUs. ElasticSearch is not the only process on this server but it is the major one.
Here's the output of getting index status using this command curl -X GET "myserver:9200/_cat/indices"
:
yellow open corpora-split tMdjFHfMR1OYfM_WptI0Kw 5 1 2005293875 0 672.2gb 672.2gb
Is there any suggestion on how can I avoid timeout errors?