Elasticsearch 7 has 4x higher AZ transfer costs in AWS compared to 2.4

Since upgrading to Elasticsearch 7.4.1 from 2.4. (I know :stuck_out_tongue: ) we are seeing our AWS availability zone transfer charges skyrocket. Both the old and new clusters used the cluster.cloud.allocation.awareness.attributes: aws_availability_zone setting and are deployed across 3 AZ's within an AWS region.

What I'm wondering is, did we miss some "compress internal communication" setting or similar for shard shuffling?

Does ES7 move more shards than 2.4? We have cluster.routing.allocation.node_concurrent_recoveries: set to 2 for both, but didn't notice much difference if we went to 8.

Found this https://www.theguardian.com/info/2020/feb/04/taming-data-transfer-costs-with-elasticsearch but they seemed to be having problems with queries costing more.

Our costs are directly linked to the data nodes and appears to be them talking to each other.

Have a look at the docs, especially the transport.compress setting.

1 Like

So we set up a little lab:
ES2 - 2 data nodes in different AZs
ES7 - 2 data nodes in different AZs
ES7 with transport.compress true - 2 data nodes in different AZs

Ingested some data, then deleted it ready for our tests. Using packetbeats to monitor port 9300 on the data nodes.

ES2 idle before we start re-ingestion transferred 156 MB between nodes
ES7 idle transferred 60GB!!
ES7 idle with compression transferred 25GB!!

Re-checking our numbers now, but that's crazy. What is it sending with no indices that's 60GB?

Going to check again with self-monitoring off on ES7 and some other things that may be different between 2->7

So this is very telling ... from a blank slate, we ingested the same set of data (Shakespeares works) into both ES2 and ES7 (with compression on)
The initial ingestion is similar but then suddenly the data nodes on ES7 continuously transfer data around ... and ES2 is done.

Yes, this seems unexpected indeed. Can you disable compression and grab a full packet capture of the network traffic in this experiment using tcpdump? I'm very curious why we're transferring MBs of data so frequently.

1 Like

For sure. What's the best way to get that to you?

I sent you a private message with an upload link.

Used the following on the 2 data nodes:

 # timeout 30 tcpdump -vv -i any -w data2.cap port 9300
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
1779 packets captured
1788 packets received by filter
0 packets dropped by kernel

Hope that works for you?

I'm seeing quite a few of these: https://github.com/elastic/elasticsearch/pull/38262

I don't see any retention lease sync actions. The only actions I see are these, along with their counts:

     17 cluster:monitor/nodes/info[n]
      6 cluster:monitor/nodes/stats[n]
      3 cluster:monitor/stats[n]
     14 indices:data/read/get[s]
      4 indices:data/write/bulk[s][r]
      3 indices:monitor/recovery[n]
      7 indices:monitor/stats[n]

There look to be several hundred indices in this cluster so the index stats and recovery stats are ~200kB, so that's more than half of the 4MB file you shared since something's requesting those stats periodically. There's some write traffic to monitoring-beats-7-2020.04.08 and some read traffic from Kibana. I don't see anything particularly unexpected here, however, Elasticsearch is apparently doing quite a bit of work to serve client requests.

1 Like

Ah, sorry, I had a bug in my filter. I do now see retention lease syncs:

     17 cluster:monitor/nodes/info[n]
      6 cluster:monitor/nodes/stats[n]
      3 cluster:monitor/stats[n]
      3 cluster:monitor/xpack/analytics/stats[n]
      3 cluster:monitor/xpack/sql/stats/dist[n]
     14 indices:data/read/get[s]
      4 indices:data/write/bulk[s][r]
      3 indices:monitor/recovery[n]
      7 indices:monitor/stats[n]
    264 indices:admin/seq_no/retention_lease_background_sync[r]
      4 indices:admin/seq_no/global_checkpoint_sync[r]
     18 indices:data/read/search[phase/query]

264 is not so many given how many indices there are, and each one is only a few hundred bytes, so I don't think that's particularly significant.

Based on your results there (Is that tool available somewhere? :smile: ) we set our xpack.monitoring.interval to 60s from the default (10s). I think you can see where that happened! That should help for sure. We may even go 120s ...

You probably have it already. I just used tcpflow to split the pcap file up, then cat * | strings | grep '\(cluster\|indices\):' | sort | uniq -c to find all the things that looked like action names. Then a few spot checks with Wireshark (Statistics -> TCP flow graphs) to measure some approximate message sizes.

I'm not sure there's too much value in optimising monitoring traffic as you suggest: a few MB per minute is pennies per day in cross-AZ traffic costs, and normally completely swamped by actual production traffic. I don't think this is the reason for the 60GB of traffic that you mentioned above.

Last update, with transport.compress on we are seeing the cost benefits we were hoping for with only a slight increase in CPU. Still not sure what was happening with the lab tests, but in production we seem to be at an acceptable data rate.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.