This is a small Elastic Cloud deployment with 3 Hot Nodes (2GB RAM / 30GB Disk each) and a single Frozen node (4GB RAM, 320GB Disk cache), plus we have a small (single hot node) monitoring deployment. The May bill (when most of the configuration was pretty well set) came in and inter-node transfer was 25% of the total bill. Because we've seen that monitoring can be quite expensive and because it seems you can't get a breakdown of the costs within a deployment (you can see by "product" or by deployment, but it doesn't seem you can see the cost by product for a particular deployment) , I turned off monitoring metrics on June 4th to see if that would impact the daily costs.
Turning off monitoring metrics did cut out over 50% of the total inter-node transfers, so that much is good. (Apparently sending the monitoring metrics from the production deployment to the monitoring deployment counts as inter-node transfer, which is better than if it was data transfer out and in.)
But it's curious to me that over the last 3 days, the billing console shows the data transfer in as 14-15GB/day and the Inter-Node transfer as 135-145GB/day: over 9x more than the data transfer in. Is this normal that inter-node transfer would be 9x data transfer in? I thought that it would be reasonable for it to be equal to transfer in to account for the replica index, but 9x seems to imply a whole lot of chatting between the nodes? What causes this? Is there anything to be done about this?
I did have a 2-way production deployment, but the hosting cost of 3x 30GB was cheaper than 2x 60 while providing sufficient Hot storage. At the moment it looks like I might be able to go back to 2x 30GB though. Would that reduce the inter-node transfers? Either way there's only a single replica per index.
Also I have seen cases where users have reconfigured their clusters perhaps even several times which can cause a lot of shard migrations that can add up in inter node / zone transfer charges.
A normal stead state observability cluster can range from 5 - 15% DTS charges avg but.. it depends a lot on cluster / index / ILM design / implementation etc.
No help from support. They apparently consider that consultive support and they only provide break/fix support even at the "enterprise" level. They wouldn't even comment on whether inter-node transfer of 5-10x the ingest data is to be expected or not. Very frustrating, but consistent with my experience with support: if you have nodes down, they can help get them back up, but anything beyond that is pretty limited.
I work with observability clusters every day that is unusual from my experience.
Example here are my observability Clusters in GCP the a 3 x 8 GB Ram HOt with a Frozen as a matter of fact. Mine are on GCP... (this includes another Cluster as well)
Data In vs InterNode is about 1.5X that is about what I expect... This is for my last month May
By the way this includes my monitoring cluster : metrics plus logs to my small monitoring cluster.
Our May numbers are below. Compared to your example, we had less than half the transfer in but almost 40% more Inter-node than you did. That also includes the monitoring cluster that month. The absolute numbers in June seems better, presumably since I turned off monitoring on the 4th, but the ratio is still unexpected.
May:
Data transfer and storage
Charge
Quantity
Rate
Cost
AWS Data Transfer In (per GB)
3054.7 GB
0.0000 per GB
$0.00
AWS Data Transfer Inter-Node (per GB)
13837.0 GB
0.0160 per GB
$221.39
AWS Data Transfer Out (per GB)
158.4 GB
0.0320 per GB
$5.07
AWS Snapshot Storage API (1K Requests)
21264688 requests
0.0018 per 1k requests
$38.28
AWS Snapshot Storage (per GB-month)
176.7 GB per month
0.0330 per GB
$5.83
Total
$270.57
Last week (June 10-16):
Data transfer and storage
Charge
Quantity
Rate
Cost
AWS Data Transfer In (per GB)
166.3 GB
0.0000 per GB
$0.00
AWS Data Transfer Inter-Node (per GB)
1239.9 GB
0.0160 per GB
$19.84
AWS Data Transfer Out (per GB)
14.4 GB
0.0320 per GB
$0.46
AWS Snapshot Storage API (1K Requests)
3039894 requests
0.0018 per 1k requests
$5.47
AWS Snapshot Storage (per GB-month)
19.9 GB per month
0.0330 per GB
$0.66
Total
$26.43
That's inter-node 7.5x internal. I'll PM you (assuming I can figure such out) the account id.
FWIW, support has been still unable to answer the question. I did get billing support involved since it seems that if you're billing for something, you should be able to identify what that thing is. So far no luck there. They were able to identify that most of the inter-node traffic is coming from the hot nodes, but that tells us relatively little. I've asked for the opposite side of that: where is the traffic going to but so far have not gotten a response on that.
Feedback from "the devs" suggested reducing the snapshot frequency from every 30 minutes. So I changed it to 2 hours. That may have reduced the inter-node transfer by 10%, but it's hard to say for sure: there were several days at about 210 GB/day (which was recently up from about 150 GB/day) then after the snapshot frequency change, it dropped to about 190 GB/day. So maybe that's good, but what had raised it from 150 to 210 is a mystery. (And why it was even 150 GB is a mystery.)
Back at the beginning of the month I had disabled the metrics logging for t he environment, and that had dropped it from over 300GB/day to around 150GB/day. So that is goodness. (Not having metrics is not ideal, but...)
What I can say for sure is that the Inter-node data transfer is not directly tied to the Data Transfer in.
The June bill was better than the May bill, presumably because I turned off those metrics: Inter-node transfer was only a bit over 18% of the bill. That we still can't identify what drives that cost is frustrating though.
The short answer is apparently nobody knows or nobody can or will answer. Primary recommendation seems to be to engage with a consulting contract, but that's a) relatively expensive and b) presumably no guarantee of getting an actual answer.
Interestingly, the daily data transfer did drop significantly after upgrading to 8.3.1. That version apparently contained a bug that disabled a number of "rules". While I think I followed the appropriate remediation for those, data transfer is still lower than it was, now running at around 110GB/day, which is certainly far better than the ~300GB/day it was when I started on this quest. And it does lead me to believe that some (or some number) of rules someplace is impacting it. What rules and why the rules are configured the way they are remain a mystery.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.