Indexing performance - Choppy graphs - FreeBSD ZFS

LogBabel · October 31, 2019, 6:12pm

Hello all.

Seeking some ideas to help diagnose our cluster performance. I'll start with the specs:

Data nodes: 5x 64-core 64GB RAM. Oracle Java 8. FreeBSD. ZFS RAID10 36TB.
Search/ingest nodes: 2x with same specs as above, minus the large RAID10 array.
Kibana nodes: Run as VMs on the search/ingest nodes.

The problem I'm having is poor indexing performance. The indexing graph looks like this: /////// whereas it used to look like this: ---------^---_---^-----. I suspected this may be caused by Java garbage collection.

What changed: The systems previously ran Debian Linux with Oracle Java 11 and a giant RAID0 with 72TB. I'm not so sure the performance gains of Java 11 from Java 8 are so profound although maybe this is proof. But the GC messages in the ES logs are too infrequent to correlate.

I've used disk benchmarking tools to verify the ZFS volumes can perform to at least 300MB (I was seeing 800MB previously with Debian/RAID0). This is just to say the problem seems completely isolated to Elasticsearch and/or Java performance.

Unfortunately, for the time being it doesn't seems Java11 or Java13 will run on FreeBSD, at least without some effort. This is still a work in progress. Switching from OpenJDK8 to Oracle Java8 did show some improvement.

In the meanwhile I'm interested to know if anyone has ES or Java specific tuning ideas to resolve this. This cluster used to ingest 35k EPS and now it seems to barely handle 20k EPS. The only correlation I've found so far is the ES Index memory is fluctuating at the same interval and I'm not certain why at the moment.

Thanks !

Christian_Dahlqvist · October 31, 2019, 6:25pm

Disk throughput numbers are not always very useful as they often assume large consecutive writes while Elasticsearch load primarily results in a lot of random reads and small writes. What does disk utilisation and iowait look like on the nodes?

LogBabel · October 31, 2019, 6:46pm

Disk utilization is low (1%) and there are only four indexes open. Only one index is being written to and it is comprised of 5 shards. I'm looking into finding the iowait information, it's not currently visible with the tools I'm using.

I wondered if it's an issue with threading and the write operations are blocking for some reason. (might relate to iowait). I did consider increasing the "processors" count in elasticsearch.yml to see if additional threads have effect and I'm looking into reviewing the ES metrics for thread pools, buffers, and etc.

Christian_Dahlqvist · October 31, 2019, 6:59pm

How much data do you have in the cluster? Any evidence in the logs of long or frequent GC?

LogBabel · October 31, 2019, 7:02pm

At the moment it is roughly 0.7 TB. Previously it was operating ok with 30+TB and roughly 25/120 open/closed indices.

The Java GC messages are infrequent but when they do report it's roughly 450ms. Another metric I'm looking into is the open files count and limit, but I think ES would generate an error about this.

LogBabel · October 31, 2019, 7:12pm

One error message I think is related pertains to memory allocation errors on the indexing front-end nodes:

[2019-10-31T19:06:25,169][WARN ][o.e.t.OutboundHandler ] [elastic1] send message failed [channel: Netty4TcpChannel{localAddress=/10.1.1.1:49857, remoteAddress=10.1.1.6 java.io.IOException: Cannot allocate memory
at sun.nio.ch.FileDispatcherImpl.writev0(Native Method) ~[?:?]

This is sometimes preceded by a disconnect notice:

[2019-10-31T19:01:30,340][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [elastic1] failed to execute on node [xyz]
org.elasticsearch.transport.NodeNotConnectedException: [elastic6][10.1.1.6:9300] Node not connected

Although when I checked the memory statistics on the indexing front-end I didn't find anything suspicious. Java Heap is set to 30G and locked memory is enabled.

LogBabel · October 31, 2019, 7:41pm

Seems the master nodes are in a state of flux, failing and rediscovering. Odd because the cluster state remains green. Although I can see where this is stalling write operations.

Although the CPU/Memory utilization on all nodes is low, so I'm not sure where system resources are at fault. I'm inclined to think there's a tunable somewhere that will resolve this.

Christian_Dahlqvist · October 31, 2019, 10:10pm

If you look at the support matrix FreeBSD is not an officially supported platform while Debian is. I wonder if this could be related.

LogBabel · October 31, 2019, 10:23pm

Thanks, I didn't notice that. Breaking new ground then. Although in theory Java makes platforms irrelevant, I can see how support only applies to certain systems. FreeBSD does have some differences with regard to threading, and Java.

I've a few more ideas to try before giving up and trying VirtualBox with Linux or a reinstall.

Thanks for the help.

LogBabel · November 1, 2019, 5:09am

Problem resolved.. classic network interface issue with auto-negotiation. Works great now and using OpenJDK13. (correction to earlier post)

system · November 29, 2019, 5:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance degrading after a couple of weeks Elasticsearch	7	519	October 30, 2018
Performance problems Elasticsearch	12	574	July 6, 2017
Index Dimensioning and Optimization (across the Cluster) Elasticsearch	6	374	March 24, 2021
Newbie performance troubleshooting, high load spikes on ES nodes Elasticsearch	5	5036	June 11, 2018
Indexing performance problems Elasticsearch	10	375	July 6, 2017

Indexing performance - Choppy graphs - FreeBSD ZFS

Related Topics