Reducing Memory Consuption while Bulk Indexing

I seems bulk indexing is using a lot of memory. It is not for long, but the day sees much more indexing demand than shown here. Is there a way I can reduce memory required during indexing, while still keeping the indexing rate high?

Version 1.4.2

Maybe my _routing is causing this?

"mappings": { "test_result": { "_routing": { "path": "build.revision12","required": true}, ....

Here is my yml file:

cluster.name: active-data
node.zone: primary
node.name: primary
node.master: true
node.data: true

cluster.routing.allocation.awareness.force.zone.values: primary,spot
cluster.routing.allocation.awareness.attributes: zone
cluster.routing.allocation.cluster_concurrent_rebalance: 1
cluster.routing.allocation.balance.shard: 0.70
cluster.routing.allocation.balance.primary: 0.05

bootstrap.mlockall: true
path.data: /data1, /data2, /data3
path.logs: /data1/logs
cloud:
aws:
region: us-west-2
protocol: https
ec2:
protocol: https
discovery.type: ec2
discovery.zen.ping.multicast.enabled: false
discovery.zen.minimum_master_nodes: 1

index.number_of_shards: 1
index.number_of_replicas: 1
index.cache.field.type: soft
index.translog.interval: 60s
index.translog.flush_threshold_size: 1gb

indices.memory.index_buffer_size: 20%
indices.fielddata.cache.expire: 20m
indices.recovery.concurrent_streams: 1
indices.recovery.max_bytes_per_sec: 1000mb

http.compression: true
http.cors.allow-origin: "/.*/"
http.cors.enabled: true
http.compression: true
http.max_content_length: 1000mb
http.timeout: 600

threadpool.bulk.queue_size: 3000
threadpool.index.queue_size: 1000

Thanks

Increasing your bulk threadpool will only be increasing that memory use.

How big are you bulk sizings? How many nodes in the cluster?

That aside I don't see a major problem, you aren't really approaching your heap maximum and things appear pretty good in terms of resource usage.

There are 10 nodes in the cluster, but this node is the "primary" containing all shards. The 9 other nodes are recipients of bulk index requests (and containing some replica shards). Each of the 9 have a daemon inserting 5000 documents per bulk request. Each document is about 1K. It works out to a rate of about 400*9 documents per second. Faster would be nicer, but this primary node is the bottleneck.

Here is an example of what it looks like just before ES dies:

Why do you have all the primaries on a single node then?

That is the only node in that zone right now, as per the *.yml file.

If I understood why ES is consuming so much memory during bulk indexing, I may be able to mitigate the problem.

It'll use what it needs.
But not spreading the primaries, and thus the resource consumption, around is going to be part of your problem.

If you are using 1 replica and two zones, where one of the zones only have a single node, this will be indexing all records, which will lead to much higher load than the other nodes. Elasticsearch is usually deployed in clusters where all nodes are equal so that load can be evenly distributed. What is it you are trying to achieve using this setup?

I am aware that a node with more shards will have more load. I am concerned with what I can do to reduce indexing load, or move the work to the other nodes. Maybe the fact this one node is master? Should I assign the master to be in the other zone? Maybe a slave with all shards will use less memory?

My setup is designed to achieve minimal price on AWS. The benefit to having one machine in a zone is it's 1/2 the cost of two machines, and 1/3 the cost of three machines.

I added a "coordinator node" in the same zone as the 9 nodes. This coordinator is the master (node.master: true), and has no shards (node.data: false). Here is the coordinator under heavy indexing load:

It is doing nothing, as expected.

The primary node is the same, it still has all shards, but was restarted with (node.master: false). Here is it under higher load:

What's important to see is the rate of memory consumption is much smaller. Hopefully this leads to less OOM events that crash this node. I believe the growth in memory is proportional to the number of primary shards on a node. By moving the master to the other zone, it appears my primary node is assigned less primary shards, and so has lower memory consumption.

This is not perfect, eventually the primary node's shards are designated primary, and the memory consumption rate goes up accordingly. I hope I can control the algorithm that determines what shard is declared primary; preferring primaries be assigned to the zone with the most nodes.