Error 413 and Disk Watermark on ECE but not on vanilla Elasticsearch

Ong · May 16, 2019, 10:43am

I am doing bulk indexing on an ECE deployment (running on ES version 7) and encountered 2 errors.

The first is error 413 (request entity too large) and it happens when indexing a series of documents around 1 - 2 MB. The error message is:

ResponseException[method [POST], host [http://832fc92e50ae46ec9bcd8ca23545258f.ece.dev.gov.sg:9200], URI [/_bulk?timeout=2000s], status line [HTTP/1.1 413 Request Entity Too Large]\n<html>\r\n<head><title>413 Request Entity Too Large</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>413 Request Entity Too Large</h1></center>\r\n<hr><center>nginx/1.11.13</center>\r\n</body>\r\n</html>\r\n]; nested: ResponseException[method [POST], host [http://832fc92e50ae46ec9bcd8ca23545258f.ece.dev.gov

The second is that the cluster suddenly turns into read-only mode when bulk indexing halfway. Before this scenario, there were already a number of timeout errors appearing. The timeout message is:

ResponseException[method [POST], host [http://832fc92e50ae46ec9bcd8ca23545258f.ece.dev.gov.sg:9200], URI [/_bulk?timeout=2000s], status line [HTTP/1.1 504 Gateway Timeout]\n{\"ok\":false,\"message\":\"Timed out waiting for server to produce a response.\"}]; nested

When bulk indexing the same set of documents on a vanilla Elasticsearch cluster, there were no such errors. The elasticsearch cluster on ECE is running on version 7 while the vanilla Elasticsearch cluster is running on version 6.5

What configuration can be done on ECE to overcome these errors?

Alex_Piggott · May 16, 2019, 12:54pm

Do you have an nginx or something load balancing requests to the proxy? It looks like that error is coming from there? You'll want to add something like client_max_body_size 100M to your nginx config

The root cause for this I think is overloading the cluster. That 504 is just the proxy complaining it hasn't had a reply from the ES server for 60s - the ES server is probably stuck garbage collecting or its task list (_cat/tasks) is full.

I am unsure why it works fine on a 6.x cluster, could be differences in performance across the versions but it's more likely to be environmental factors ... eg how much memory and CPU do the two clusters have? (In ECE by default the CPU is hard limited to approx ncores*1.2*cluster_mem/allocator_mem .. you can turn hard limiting off via the advanced cluster editor, in the Data section ... set hard_limit to false)

Could you provide more details on this? I think normally when a cluster goes into read only mode it logs a "cluster block" that explains why

Alex

Ong · May 21, 2019, 9:05am

Regarding the cluster turning into read-only mode, this is the message shown on the admin console UI.

The strange thing is only 1 deployment has data written to it but this "read-only state" message occurs for a couple of other deployments too even though they had no data inside and no data is indexed into them.

I noticed that the root file system of the OS is quite full (34GB/38GB full). However, the ECE containers are in another partition which still has ample disk space. Does a full root partition cause problems for ECE deployments?

Alex_Piggott · May 21, 2019, 1:45pm

Interesting, I can't think of a mechanism in which the root FS's capacity is used - the cluster logs should declare when indices are marked read only and why, do you have still have those?

Alex

Ong · May 22, 2019, 4:41am

Yes i have the logs and there are a lot of "high disk watermark exceeded on one or more nodes" type of messages. The logs show lines like these:

[es/i-0/es.log] [2019-05-18T00:00:15,365][WARN ][org.elasticsearch.cluster.routing.allocation.DiskThresholdMonitor] [instance-0000000000] high disk watermark [90%] exceeded on [ThXN4Cr1RmWRz0QYYDsqRg][instance-0000000000][/app/data/nodes/0] free: 2.4gb[7%], shards will be relocated away from this node
[es/i-0/es.log] [2019-05-18T00:00:45,372][INFO ][org.elasticsearch.cluster.routing.allocation.DiskThresholdMonitor] [instance-0000000000] rerouting shards: [high disk watermark exceeded on one or more nodes]
...
[es/i-0/es.log] [2019-05-18T05:46:16,078][INFO ][org.elasticsearch.cluster.routing.allocation.DiskThresholdMonitor] [instance-0000000000] low disk watermark [85%] exceeded on [ThXN4Cr1RmWRz0QYYDsqRg][instance-0000000000][/app/data/nodes/0] free: 3.5gb[10.2%], replicas will not be assigned to this node

Is this an issue with the disk storage space or some docker storage issue?

The deployments in the ECE admin UI showed that whatever data they contained did not fully occupy the allocated storage for the deployment.

Alex_Piggott · May 22, 2019, 1:53pm

However, the ECE containers are in another partition which still has ample disk space

The deployments in the ECE admin UI showed that whatever data they contained did not fully occupy the allocated storage for the deployment.

The watermarks thresholds are not calculated against the total allocator disk space, they are calculated against the disk allocation of each cluster instance. By default this is 32* the RAM of each cluster (it can be changed by editing instance configurations)

So for example the third log message you posted implies that your instance capacity was approx 32GB (which would suggest it was a 1GB instance) and that it contained 29GB of data, which was more than the configured/default max of 85% - does that sound possible?

Alex_Piggott · May 22, 2019, 9:02pm

Btw the fix is to do:

PUT /_all/_settings
{
"index.blocks.read_only_allow_delete": null
}

The block will not clear manually if you reached flood stage

as a superuser (eg elastic)

Ong · May 23, 2019, 4:40am

Yes, i did ingest some data into one of the clusters. However, the disk usage statistic was only around 9% for that cluster.

The other 2 clusters had no data ingested into them (0% disk usage) but still encountered the read-only message.

!

Hence, it seems strange why was the disk water mark being hit when the disk usage metric was not close to being full.

Alex_Piggott · May 23, 2019, 7:21pm

One thing to note is that the disk usage displayed in the UI is (for historical reasons, we're fixing it) not the same as the disk usage that ES sees, it's only an approximation based on a) the expected disk capacity and b) the disk usage of the open indices. So if the disk capacity is actually different (eg XFS not configured), or there are large amounts of logs and/or closed indices, then the numbers can diverge.

If you docker exec -it <container-id> bash for eg one of the ES instances that had the watermark errors, it would be interesting to see what df reported, and also the contents of /xfs_quota.txt

I'd also do docker inspect <container-id> and double check that /app/data is mounted onto the large volume and not the root one...

...One thing I noticed is that the UI screenshots both 16GB disk capacity, but the error message indicated that the disk capacity was ~32GB which is suspiciously similar to the root disk you mentioned

system · June 6, 2019, 7:22pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
413- Request too large exception Elasticsearch	11	11199	November 4, 2022
Elasticsearch.Net.ElasticsearchClientException: The remote server returned an error: (413) Request Entity Too Large Elasticsearch	4	7217	February 7, 2019
413 Request Entity Too Large Elasticsearch	1	3015	July 28, 2017
Solving error 413 when using BulkAllObserver (NEST) Elasticsearch language-clients	1	395	February 28, 2023
ElasticSearch Java Client 8.1.2 - Entity content is too large for the configured buffer limit Elasticsearch docker	3	4496	May 30, 2022

Error 413 and Disk Watermark on ECE but not on vanilla Elasticsearch

Related topics