ElasticSearch Bulk indexing is not scaling

Hi All,

I have set up a single node elastic search in a 16 core, 32g machine. I have given 16g as the ES_HEAP_SIZE. I am using ES 1.7.1.

I have another application which is sending metric information to elastic search using BulkProcessor. By trial and error, found that best indexing performance is achieved by sending 3000 documents (all the requests in the bulk requests are index requests) in a single bulk request.

With 3000 documents in a bulk request, I am able index 32000 documents in a second.

This indexing performance is not enough for me. So I added one more elasticsearch node to the cluster(with same kind of hardware and settings). I verified the cluster by checking http://IP:9200/_nodes. I can see that there are 2 nodes in the cluster now. Is there any other way to make sure that the cluster is working?

In my application, I am using transport client to send the bulk requests. I configured the transport client with both elastic search node details.

When I run my tests, I can only see a negligible improvement in throughput. Instead of 32000 documents indexed, with the cluster I am only able to index 33000 documents. Just an increase of 1000 documents per second.

For testing, I have set "index.number_of_replicas" to "0" in both elasticsearch nodes in the elasticsearch.yml.

Since I have 2 nodes in elasticsearch cluster, I have set "index.number_of_shards" to "6". i.e., 3 shards per node.

What am I missing here? Why the bulk indexing is not scaling when I add more nodes to the cluster?

Thanks,
Paul

Few more information on the ES set up.

Both nodes are installed on RHEL 7 machines. Both machines have SSDs.

Have increased the "max_file_descriptors" to 65536 in both machines.
Set "sysctl -w vm.max_map_count=262144" in both machines.
Did "sudo swapoff -a" in both machines.

The following settings are added in elastcisearch.yml of both nodes.

bootstrap.mlockall: true

indices.store.throttle.max_bytes_per_sec: 200mb

threadpool.bulk.type: fixed
threadpool.bulk.size: 64
threadpool.bulk.queue_size: 300

The refresh_interval for the index is set in the index template. In index template, it is set as:

"settings" : {
"index.refresh_interval" : "120s"
}

In the template, I have also specified the mappings. All my fields in the document are "not_analyzed". For all my fields "doc_values" is set as true. None of fields are included in "_all" field.

The document IDs are auto generated by elasticsearch.

Since, the elasticsearch was not scaling I did some initial profiling on an ElasticSearch node.

From the profiling report, I found that one method is taking 40% of the CPU time.

org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(Term)

To avoid this method call, I also added another setting in elasticsearch.yml.

index.optimize_auto_generated_id: true

I do not see any documentation anywhere about this setting. I could only find the java doc about that setting in the source.

The java doc said "iff documents with auto-generated IDs are optimized if possible. This mainly means that they are simply appended to the index if no update call is necessary."

In my case, all the documents I am indexing are final. I will not be updating or deleting the document. So will it be right OK to set this setting to true.

I re profiled ES node after adding the setting. For the profile report, I can see that, the 40% is no there any more. I also debugged to make sure the flow of the code.

But even after the change, I am not seeing any improvement.

I am also linking the both the profile reports generated by YJP 2014 build.

Can anyone please tell me if there is any setting that I can do to improve the bulk indexing throughput of ES 1.7.1.

Thanks,
Paul

How large are your documents? How many bulk indexing threads did your application use during the benchmark? Did you verify that the index you are indexing into does not have any replicas configured and that your "index.number_of_replicas" took effect? What does CPU and IO load look like during indexing?

Some notes:

  • general rule of thumb for ES bulk indexing is: tune slowest parts first, like I/O, others like memory or CPU later, and make quick changes first, complex ones later (or never)
  • the rate of indexing can be tuned at two sides, first at the ES server nodes, second at the ES client nodes. Two nodes are not very much to form a reliable distributed system.
  • watch the system load on both sides. If you have one client process, it must push all the docs over the wire alone. This might be your bottleneck. Use more clients to address the cluster indexing if a single client causes high load (can be a bit challenging if you have a single sequentially organized data source)
  • 32k docs per second is not bad. What is the average size of a document? And the size of a bulk request? Each request uses one or more TCP/IP packet back and forth in roundtrip manner. You should compute the network transport size per second in bytes. Often the network is slow or saturated. Network compression may help at the cost of some extra CPU cycles (set transport.tcp.compress to true on server and client side)
  • you use SSD, and you should check if indexing is still throttled at server side. Disable throttling to let ES write faster by indices.store.throttle.type to none. Also ES 1.x may perform suboptimal when large segments get merged. You should examine this area to get higher throughput if you see CPU load spikes after minutes/hours while bulk indexing. Note, many optimizations are automatically adjusted in ES 2.x in this kind of area.
  • I recommend ES 2.x
  • the overhead of random doc ID generation can be neglected compared to CPU/network load while bulk indexing
  • another area of interest is tuning on operating system level, e.g. filesystem settings, for faster writes. An example: on Linux, you might want to use elevator=noop for SSD, see also https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html

Hi,

Thank you for the reply.

Let me explain a little bit more about my set up.

I have an application which sends metric information to ES.

When a single node of ES is used with a single node of my application, I am able to handle ~27k requests/sec.

When a single node of ES is used with two nodes of my application, it is able to handle ~32k requests/sec.
In this case, each node of my application has a transport client instance configured to send the data to the same ES instance.

When two node ES cluster is used with two nodes of my application, it is only able to handle ~33k requests/sec.
In this scenario(2 nodes of my application), if I just turn off the metric collection, then it is able to handle ~40k requests/sec. The throughput is dropping when metric collection is enabled. So I believe it has some thing to do with metric collection(bulk indexing to ES).
As you told, the issue could either be in the ES client, ES server or in the network. I was trying to figure out where the issue is.

So I made a small change in the set up. Two nodes of my application were sending the bulk requests to both nodes of ES cluster. I believe, transport client will send the bulk requests in a round robin fashion to both the nodes in the cluster. So I changed it like, node 1 of my application only sends bulk requests to node 1 of ES cluster and node 2 of my app, sends the requests to node 2 of ES cluster. I believe, this scenario, is very much like the first scenario where there is one node of my application and one node of ES. The only difference is the ES nodes are a cluster. In this case, there was not much difference in the throughput. It was still ~16k per node (combined ~32k). So I think the drop was because of clustered ES nodes in the back end.

I again made another change. In the same set up. I just made the ES nodes non-clustered. Just removed the "discovery.zen.ping.unicast.hosts" properties from ES nodes. So that they will not form a cluster. In this case also, there was not much difference. The throughput is still ~16k.

So now I think the bottleneck is not the ES client or ES server nodes. It might be the network itself. I will surely try setting the property "transport.tcp.compress". But is there any way to confirm whether network is the bottleneck?

I will also try using node client instead of transport client. I think it will also help to reduce the network usage as there can be one less hop for indexing data.

And I am afraid, I will not able to upgrade to ES 2.x at this point of time. But will be upgrading in the near future.

Thanks for all the suggestions, will try them all and get back.

Thanks,
Paul

On RHEL7 you can install iptraf-ng from CentOS Mirror and watch the network traffic / statistics if they make a difference.