Slow ingestion problem (v 6.2.3)


(Kyle Lahnakoski) #1

I have a 12 node cluster, with 28 indices and 1646 shards (I am targeting 20gigs per shard)

I have spent a month, on and off, trying to figure out why ingestion is too slow; it can not keep up with the rate that the documents are produced. I have a program that uses the bulk api to push json documents to ES, and it is reporting back the total time. Here is an example:

Timer end : Add 500.0 documents to coverage20180610_000000 (took 5.884 seconds)

This time is about 200ms greater than the {"took":... time reported in the ES response, which I do not have a this time. Each record is about 3.5kilobytes.

10 of the 12 nodes are ingesting, and are reporting

  • low CPU (15% or less)
  • low disk (write 100 to 1000 megs per minute), and
  • low network (send 80 megs per minute).

2 nodes are "backup nodes", and are reporting

  • low CPU (7.5% or less)
  • low disk (500megs per minute), and
  • low network (receive 100megs per minute).

The 2 "backup nodes" just happen to have all the master shards because they were started first.

I have tried increasing the number of processes that are pushing to ES, but that does not seem to impact the over all rate of documents ingested. I can reduce the number of replicas, and the ingestion is faster, but that is not a good long term solution because this ingestion rate must be maintained in the long run. I can increase the number of ingestion nodes, and go faster, but resource utilization is still low. I can increase the number of backup nodes, and ingestion increases a little, but not worth the cost of a whole node. I have an old v1.7 cluster with similar configuration that is ingesting 10x faster, and it uses EBS drives!

Maybe there is a global coordination lock being used for ingestion? Could all masters on 2 nodes be the problem?

Thank you


(Christian Dahlqvist) #2

What portion of your indexing requests are updates to existing documents vs creation of new documents? Are you using nested documents? Are you by any chance frequently updating the same documents?


(Kyle Lahnakoski) #3

Thank you!

  • Mostly, they are new documents.
  • There are nested documents, some indices have more than others, but I am not sure how to get a count of them.

A call to http://localhost:9200/_stats gives me (after removing all but this one index):

{
	"_shards":{"total":1526,"successful":1526,"failed":0},
	"_all":{"indices":{"coverage20180610_000000":{
		"primaries":{
			"docs":{"count":233530873,"deleted":13641},
			"store":{"size_in_bytes":821566321973},
			"indexing":{
				"index_total":169939627,
				"index_time_in_millis":177056168,
				"index_current":0,
				"index_failed":0,
				"delete_total":0,
				"delete_time_in_millis":0,
				"delete_current":0,
				"noop_update_total":0,
				"is_throttled":false,
				"throttle_time_in_millis":0
			},
			"get":{
				"total":0,
				"time_in_millis":0,
				"exists_total":0,
				"exists_time_in_millis":0,
				"missing_total":0,
				"missing_time_in_millis":0,
				"current":0
			},
			"search":{
				"open_contexts":0,
				"query_total":3122,
				"query_time_in_millis":903853,
				"query_current":0,
				"fetch_total":0,
				"fetch_time_in_millis":0,
				"fetch_current":0,
				"scroll_total":0,
				"scroll_time_in_millis":0,
				"scroll_current":0,
				"suggest_total":0,
				"suggest_time_in_millis":0,
				"suggest_current":0
			},
			"merges":{
				"current":5,
				"current_docs":4071504,
				"current_size_in_bytes":14989103140,
				"total":3841,
				"total_time_in_millis":649895100,
				"total_docs":502596629,
				"total_size_in_bytes":2024259928641,
				"total_stopped_time_in_millis":0,
				"total_throttled_time_in_millis":499444784,
				"total_auto_throttle_in_bytes":104857600
			},
			"refresh":{
				"total":17085,
				"total_time_in_millis":17412701,
				"listeners":0
			},
			"flush":{"total":1610,"total_time_in_millis":10184947},
			"warmer":{"current":0,"total":15386,"total_time_in_millis":46575},
			"query_cache":{
				"memory_size_in_bytes":21893790,
				"total_count":32883,
				"hit_count":310,
				"miss_count":32573,
				"cache_size":148,
				"cache_count":378,
				"evictions":230
			},
			"fielddata":{"memory_size_in_bytes":0,"evictions":0},
			"completion":{"size_in_bytes":0},
			"segments":{
				"count":790,
				"memory_in_bytes":494785243,
				"terms_memory_in_bytes":57827435,
				"stored_fields_memory_in_bytes":231630944,
				"term_vectors_memory_in_bytes":0,
				"norms_memory_in_bytes":0,
				"points_memory_in_bytes":199946864,
				"doc_values_memory_in_bytes":5380000,
				"index_writer_memory_in_bytes":1417407368,
				"version_map_memory_in_bytes":30275869,
				"fixed_bit_set_memory_in_bytes":29227920,
				"max_unsafe_auto_id_timestamp":-1,
				"file_sizes":{}
			},
			"translog":{
				"operations":2373800,
				"size_in_bytes":11995419090,
				"uncommitted_operations":682890,
				"uncommitted_size_in_bytes":3473191346
			},
			"request_cache":{
				"memory_size_in_bytes":0,
				"evictions":0,
				"hit_count":345,
				"miss_count":2777
			},
			"recovery":{
				"current_as_source":0,
				"current_as_target":0,
				"throttle_time_in_millis":179
			}
		},
		"total":{
			"docs":{"count":466992124,"deleted":20835},
			"store":{"size_in_bytes":1652473804423},
			"indexing":{
				"index_total":339883222,
				"index_time_in_millis":330045679,
				"index_current":0,
				"index_failed":0,
				"delete_total":0,
				"delete_time_in_millis":0,
				"delete_current":0,
				"noop_update_total":0,
				"is_throttled":false,
				"throttle_time_in_millis":0
			},
			"get":{
				"total":0,
				"time_in_millis":0,
				"exists_total":0,
				"exists_time_in_millis":0,
				"missing_total":0,
				"missing_time_in_millis":0,
				"current":0
			},
			"search":{
				"open_contexts":0,
				"query_total":6314,
				"query_time_in_millis":1839745,
				"query_current":0,
				"fetch_total":0,
				"fetch_time_in_millis":0,
				"fetch_current":0,
				"scroll_total":0,
				"scroll_time_in_millis":0,
				"scroll_current":0,
				"suggest_total":0,
				"suggest_time_in_millis":0,
				"suggest_current":0
			},
			"merges":{
				"current":10,
				"current_docs":6557257,
				"current_size_in_bytes":24233445762,
				"total":7762,
				"total_time_in_millis":1289595604,
				"total_docs":1006499560,
				"total_size_in_bytes":4055087380021,
				"total_stopped_time_in_millis":0,
				"total_throttled_time_in_millis":997427813,
				"total_auto_throttle_in_bytes":209715200
			},
			"refresh":{
				"total":34231,
				"total_time_in_millis":35094649,
				"listeners":0
			},
			"flush":{"total":3250,"total_time_in_millis":19049071},
			"warmer":{"current":0,"total":30806,"total_time_in_millis":90150},
			"query_cache":{
				"memory_size_in_bytes":53433276,
				"total_count":66463,
				"hit_count":791,
				"miss_count":65672,
				"cache_size":365,
				"cache_count":903,
				"evictions":538
			},
			"fielddata":{"memory_size_in_bytes":0,"evictions":0},
			"completion":{"size_in_bytes":0},
			"segments":{
				"count":1598,
				"memory_in_bytes":990239960,
				"terms_memory_in_bytes":116017625,
				"stored_fields_memory_in_bytes":463278160,
				"term_vectors_memory_in_bytes":0,
				"norms_memory_in_bytes":0,
				"points_memory_in_bytes":400199119,
				"doc_values_memory_in_bytes":10745056,
				"index_writer_memory_in_bytes":3087651748,
				"version_map_memory_in_bytes":66396183,
				"fixed_bit_set_memory_in_bytes":58447128,
				"max_unsafe_auto_id_timestamp":-1,
				"file_sizes":{}
			},
			"translog":{
				"operations":4726086,
				"size_in_bytes":23879371208,
				"uncommitted_operations":2128203,
				"uncommitted_size_in_bytes":10794223647
			},
			"request_cache":{
				"memory_size_in_bytes":0,
				"evictions":0,
				"hit_count":715,
				"miss_count":5599
			},
			"recovery":{
				"current_as_source":0,
				"current_as_target":0,
				"throttle_time_in_millis":179
			}
		}
	}}}
}

(Robert Cowart) #4

I suspect your indices are too large. 20-30GB is good goal for an index, not a shard. I suspect that you have indexes with multiple hundreds of GB.

Are you writing daily indices? You mention most data is new documents, so if you are not using daily indices this could help a lot.

How many replicas are you using? According to the numbers you have provided you have nearly 60 shards per index, which is excessive, especially when you only have 12 nodes.

You mention storage volume in bytes. That is largely irrelevant. Write IOPS will be the limiting factor. What values are you seeing for write IOPS (install Metricbeat to help you track these number over time). What is the storage setup that you are using? NAS is bad for high ingest rates... RAID-5 or 6 is bad for high ingest rates... Spinning disk can work in some configurations, but can also be bad for high ingest rates.

I suspect a combination of excessive number of shards in indices that are too large and storage that is write IOPS saturated.


(Kyle Lahnakoski) #5

rcowart,

This slow ingestion rate is odd because we have a v1.7 cluster, with similar specs, and worse drives (EBS drives) that ingests about 10x faster.

One index with 60 shards over 10 nodes has been chosen to keep the shard size down to 20gigs so they can be moved around the cluster in a reasonable time. A new index is created each week, and we keep multiple weeks for querying. We are storing about 6terabytes; 20gig indexes would require 300 indexes; or about one new index every 3 hours. That is excessive.

Disk write operations averages about 10K/minute for ingestion machines, and 15K for backup machines; reads are 1/2 that; these numbers spike to 10x, or more, when shards move. I do not think IOPS is a limiting factor; all drives are local, and nowhere near their limit in bytes or operations.


(Christian Dahlqvist) #6

I do not necessarily think the shard size is the problem as it sounds quite reasonable to me. If you want to hold data for longer periods you generally need to have quite large shards as lots of small ones are quite inefficient.

You are however indexing into a large number of shards, which means that relatively few records from each bulk request will hit each shard. One change in Elasticsearch 6.x compared to Elasticsearch 1.x is that the transaction log is synced to disk much more frequently in order to improve resiliency.

Instead of having a weekly index with 60 primary shards, I would recommend switching to daily indices with 12 primary shards, which should result in a similar shard size. You can also use the rollover API to cut new indices based on size and/or time, which will allow you reach a specific target size irrespective of the number of primary shards you use.


(Kyle Lahnakoski) #7

Yes, we noticed the high write IOPS compared to v1.7 we turned off journaling ...

            sudo('tune2fs -o journal_data_writeback '+k.device)
            sudo('tune2fs -O ^has_journal '+k.device)

... but it made no noticeable difference to the ingestion speed; we are well within our IOPS limits for these local drives


(Christian Dahlqvist) #8

Are you allowing Elasticsearch to assign document IDs or are you setting them at the application layer?

Also, what do you mean by "backup nodes"?


(Kyle Lahnakoski) #9

The application layer is setting the document IDs; they do not conform to the optimum specs found in the docs, but that can not be changed because there are (rare) cases when documents must be replaced.

The "backup nodes" are in a separate zone, and have a copy of all shards. There is only 2 right now, and they are not meant to accept queries. More here: Moving shards is slow


(Kyle Lahnakoski) #10

I am currently testing

curl -XPUT localhost:9200/_settings -d '{"index.translog.durability": "async"}'  -H "Content-Type: application/json"

Initial numbers look good. I will get back to you tomorrow after I see the long term effects on ingestion.


(Kyle Lahnakoski) #11

The good news is changing to "index.translog.durability": "async" increased the ingestion rate significantly; the bad news it only lasted 6 hours before it dropped to normal, or worse.

Here is a chart counting the number of batches being inserted. Each batch has a variable number of records, so it is a terrible instantaneous metric, but it is good enough over long timeseries to get an idea of ingestion speed. I marked the time when I set durability to async:

As we can see increased ingestion rate, then a precipitous fall, a gentle rise, and then another big fall; now the ingestion is worse.

There was a suggestion of having more indices (each with less shards), may increase the ingestion rate. I am not sure this is true; this morning (sometime after 4amGMT) the ingestion rolled over to new weekly indices, and I see no improvement. If anything more indices seems slower.

Although ingestion is slower, the machines are busy as if they are ingesting faster:


(Kyle Lahnakoski) #12

Ingestion continues to get worse. I am attempting another experiment:

For the biggest indexes I set the sync_interval

curl -XPOST localhost:9200/coverage20180617_000000/_close
curl -XPUT  localhost:9200/coverage20180617_000000/_settings -d '{"index.translog.sync_interval": "60s"}'  -H "Content-Type: application/json"
curl -XPOST localhost:9200/coverage20180617_000000/_open

I decided to keep the async setting; it can not possibly be wrong:

curl -XPUT localhost:9200/_settings -d '{"index.translog.durability": "async"}'  -H "Content-Type: application/json"
curl -XPUT localhost:9200/_settings -d '{"index.translog.flush_threshold_size": "1000mb"}'  -H "Content-Type: application/json"

I will let you know tomorrow if this improved anything.


(Kyle Lahnakoski) #13

This is the painful part. The blue is a typical CPU of a 1.7 node, while orange is a 6.2 node:

The old ES version is able to ingest much faster because it is using CPU; the spikes correspond to when there is a backlog, and plenty to ingest. The new cluster can not keep up, CPU generally stays low despite there always being a backlog.


(Kyle Lahnakoski) #14

I am now attempting making a daily index with 24shards each. The data volume is currently about 2.5T with 800M records per week. The new index is given the following settings:

"settings": {
    "index": {
        "refresh_interval": "300s",
        "number_of_shards": "24",
        "translog": {
            "flush_threshold_size": "1000mb",
            "sync_interval": "60s",
            "durability": "async"
        },
        "provided_name": "coverage20180622_000000",
        "creation_date": "1529854723598",
        "number_of_replicas": "1",
        "uuid": "iYsl_ObIR8yEZN9lLovvfg",
        "version": {
            "created": "6010299"
        }
    }
}

(system) #15

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.