ElasticSearch on NAS

We have deployed ES7 to an Oracle PCA environment. This consists of several Oracle VMs connected to an Oracle NAS filer. Initially, the ES data partitions were mounted as NFS, but we are hitting the filer IOPS limits. Our architect is suggesting that we mount the drives as iSCSI instead of NFS. Any opinions on this? Any gotchas (e.g., should we beef up the VMs)?

Though I dont believe iSCSI would be better, the testing would tell you the performance.

using fio to test IOPS and ioping to test latency.

1 Like

Are the disks in the NAS filer solid state or spinning disks?

They're spinning disks. This is an Oracle zs7 appliance (which I know does not tell you all that much because they're configurable). We are considering moving to SSD but let us ignore that for the purpose of my post.

I'm not sure you can fully ignore it as this could be the source of your issue.

If you have a cluster with heavy indexing, using spinning disks will impact elasticsearch performance, it is one of the things you need to check and change when tuning for indexing.

But can you give more context about your issue? You said that you are hitting the IOPS limit of the filer, this could mean that your disks cannot write fast enough, how is this reflecting in your nodes? Is the CPU load increasing? Are you getting some lag in your data ingestion? Are you using replicas?

I don't think that changing from NFS to iSCSI wil have much difference, increasing the CPU/RAM specs of the VMs will probably not change much also as the issue seems to be I/O bound.

Getting a good performance when using remote storage is kind of tricky as it may need a lot of tuning and testing.

1 Like

It would help if you described your use case and expected workload, e.g. is it indexing or query heavy?

If you use fio to test performance, make sure to test a mixed workload with reasonably small random reads and writes with fsync. Elasticsearch does generally not perform very large sequential reads and writes, which tend to result in superior performance numbers.

If your use case is indexing heavy, e.g. logs or metrics ingestion, be aware thet Elasticsearch is very I/O intensive and that storage as pointed out often is the limiting factor. I would recommend this video on storage performance.

I understand that SSD is better than spinning disk and that SAN is better than NAS.

However, my question was: Within the limitations of the existing hardware, do you believe that iSCSI (over IP network) will perform better than NFS? So far, the reaction has been negative.

The application is indexing heavy. It needs to index approximately 10,000 documents per second. We have the cluster configured with 3 masters and 3 data nodes. The data nodes have ~50 GB of RAM. One concern is that, if we move to iSCSI, we will need to increase the RAM on the data nodes.

NFS have historically had issues that can lead to data corruption and instability. Not sure if that is still the case with later versions. Changing to iSCSI may help with this, but I doubt it will in any way dramatically improve storage performance.

How are you indexing into Elasticsearch? Make sure you are using reasonably large bulk requests and that you are not indexing into a large number of shards as that reduces the effectiveness of bulk operations.

If you want to see what your current hardware can do (or at least something close to it), I would recommend creating an index with 3 primary shards (one per data node) and one replica. Make sure you are using a bulk size of around 1000 documents (depends on document size) and index all data into this single index.

Yeah we are doing all the usual things. We are using bulk requests from Logstash. When the filesystem is behaving, all the LS indicators are good (no excessive CPU, no backups, no timeouts, etc.).

The motivation for switching to iSCSI is that the number of IO operations to the filer are exceeding the vendor limits. The hope is that switching to iSCSI will reduce this number because iSCSI is allegedly less chatty than NFS. Specifically, we are seeing a large number of read operations (4 times the number of writes). Since we are not doing a lot of reads in our code (we in fact modified our code to minimize the number of reads), we think the reads are being done by ES itself--or as a byproduct of NFS.

Are you allowing Elasticsearch to assign the document IDs when indexing? If not, each indexing operation will essentially be a possible update as Elasticsearch must check if the document exists or not. This would add additional read IOPS.

Nore that Elasticsearch/Lucene writes immutable segments that are then later merged into larger ones. This results in a good amount of reading/rewriting of data and can explain at least some of the reads you are seeing.

How many indices/ shards are you actively indexing into?

Have you run iostat -x on the nodes while indexing?

The default batch size in Logstash is as far as I can recall 125, so it may be worthwhile trying to increase this to 1000 to see if it makes any difference.

For the high volume data, we are using UUIDs. I cannot remember if we are letting ES assign them. We have our own doc ID field in addition to _id. It may be worth exploring a strategy of letting ES assign the ID and copying it to our ID field using a script or an ingestion pipeline but that's something we can do down the line.

The number of shards depends on the document type. Some indices are configured with 1 primary shard, others as many as 5 primary. This is all based on our estimate of the volume of data in the index. It's a good reminder about the way shards are merged.

We ran nfsiostat to look at I/O performance. The RTT for reads is much higher than for writes and is, furthermore, highly variable (from < 1 ms to 200 ms). We think this is a by-product of exceeding the vendor limits.

We did play with LS batch sizes. The current setting is a result of those experiments.

Setting your own document IDs has a significant impact on indexing throughput, especially as indices get larger, so that is the first thing i would change. Using your own IDs adds a lot of reads.

Reducing the number of shards you are writing to is the second priority. If you can not do this, make sure you have separate Logstash pipelines (see pipelines.yml) for each index so each bulk request target as few shards as possible.

If you want to keep UUIDs and the current index structure I would not be surprised if you need SSDs as this is nowhere near oprimal for slow storage. Also be aware that if you run indexing close to the cluster limit there will be little resources available when you ecentually need to query the data.

Hmmm good points.

We are running multiple LS instances, one per data type, so that part is OK. I'll have to look into the doc IDs.

Besides of what Christian already said, you could also check the index.refresh_interval of your indices, increasing it can help the indexing speed.

1 Like

Yes, we have already adjusted index.refresh_interval.

Overall, my takeaway is:

  1. Modify the indexing logic to allow ES to generate IDs. We'll have to add special logic to handle recovery scenarios. Since we need our own ID field, we will have to write some logic to copy the generated value of _id to our custom ID field.

  2. I got new information this morning indicating that, even with this change, we will most likely blow through the vendor IO limit, so we still need to modify the storage architecture.

I do not think that is possible. You can however still store your UUID in a field in the document even if the document ends up having a different ID.

If you manage to switch to SSDs that may require fewer changes.

I know that it has been said, but I will reiterate the fact anyway. I cannot stress how much spinning HDDs negatively impact Elasticsearch ingest performance. Make those HDDs network-attached and the added latency makes a bad situation even worse.

Above @Christian_Dahlqvist linked to the video I made a few years ago. Since then I have investigated this topic further with many more variations and more powerful hardware. The results...

  • HDDs are still terrible terrible terrible for ingest (although they can provide some value for cold/frozen tier).
  • SSDs... ANY SSDs... even cheap SATA SSDs... make all of the difference.
  • NVMe vs. SAS/SATA SSDs - no real world difference
  • PCIe 3 vs. PCIe 4 - no real world difference
  • RAID-0 smaller SSDs vs. single large SSD - no real world difference

SSD makes the difference, especially when locally attached. It doesn't have to be an exotic and expensive configuration.

The only somewhat fancy storage that can make a difference in some cases is the use of Optane drives in hot nodes. Ingest performance is about the same, but the Optane drives have far lower read latency while the drive is under a high write load. This means that the hot nodes will be able to read data faster to answer queries even with a high ingest load. That said, 10K docs/sec on a 3-node single-tier cluster is very unlikely to benefit from Optane, so just get some simple SSDs.

Maybe it is time for an update to that video that demonstrates clearly all of the above.

5 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.