Elasticsearch on ZFS best practice

Hello,

I'm running an Elasticsearch node on a FreeBSD server, on top of ZFS storage. For now I've considered that ES is smart and manages its own cache, so I've disabled primary cache for data, leaving only metadata being cacheable. Last thing I want is to have data cached twice, one time is ZFS ARC and a second time in application's own cache. I've also disabled compression:

$ zfs get compression,primarycache,recordsize zdata/elasticsearch
NAME PROPERTY VALUE SOURCE
zdata/elasticsearch compression off local
zdata/elasticsearch primarycache metadata local
zdata/elasticsearch recordsize 128K default

It's a general purpose server (web, mysql, mail, ELK, etc.). I'm not looking for absolute best ES performance, I'm looking for best use of my resources.
I have 16 GB RAM, and I plan to put a limit to ARC size (currently consuming 8.2 GB RAM) so I can mlockall ES memory. But I don't think I'll go the RAM-only storage route (http://jprante.github.io/applications/2012/07/26/Mmap-with-Lucene.html) as I'm running only one node.

How can I estimate the amount of memory I must allocate to ES process?

Should I switch primarycache=all back on despite ES already caching data?

What is the best ZFS record/block size to accommodate Elasticsearch/Lucene IOs?

Thanks,
Patrick

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/FBBA84AE-D610-4060-AFBC-FC7D5BA0803F%40patpro.net.
For more options, visit https://groups.google.com/d/optout.

No one?
On 13 mai 2014, at 07:39, Patrick Proniewski elasticsearch@patpro.net wrote:

Hello,

I'm running an Elasticsearch node on a FreeBSD server, on top of ZFS storage. For now I've considered that ES is smart and manages its own cache, so I've disabled primary cache for data, leaving only metadata being cacheable. Last thing I want is to have data cached twice, one time is ZFS ARC and a second time in application's own cache. I've also disabled compression:

$ zfs get compression,primarycache,recordsize zdata/elasticsearch
NAME PROPERTY VALUE SOURCE
zdata/elasticsearch compression off local
zdata/elasticsearch primarycache metadata local
zdata/elasticsearch recordsize 128K default

It's a general purpose server (web, mysql, mail, ELK, etc.). I'm not looking for absolute best ES performance, I'm looking for best use of my resources.
I have 16 GB RAM, and I plan to put a limit to ARC size (currently consuming 8.2 GB RAM) so I can mlockall ES memory. But I don't think I'll go the RAM-only storage route (http://jprante.github.io/applications/2012/07/26/Mmap-with-Lucene.html) as I'm running only one node.

How can I estimate the amount of memory I must allocate to ES process?

Should I switch primarycache=all back on despite ES already caching data?

What is the best ZFS record/block size to accommodate Elasticsearch/Lucene IOs?

Thanks,
Patrick

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/80091CC6-51BE-4595-8916-EFA0C5B91676%40patpro.net.
For more options, visit https://groups.google.com/d/optout.

There is not much to add

  • estimating ES process memory really depends on individual requirements
    (bulk indexing, field cache, filter/facet, concurrent queries) - just take
    a portion of your data, measure memory/CPU/disk I/O, and extrapolate - best
    is to add nodes if resources get tight. Rule of thumb is 50% of RAM to ES
    heap

  • you are correct, primarycache=all may buffer more data than required
    (useful for maximum ZFS performance). You have already limited the ARC
    size. Use mmapfs for ES store, this should work best with
    primarycache=metadata

  • ZFS recordsize for JVM apps like ES should be default which is 4k. Also
    with ES, important is to match ZFS recordsize with kernel page size and
    sector size of the drive so there is no skew in the number of I/O
    operations. Check for yourself if higher values like 8k /16k / 64k / 256k
    gets better throughput on ES data folder. On certain striped HW RAID
    devices it may be the case, but I doubt it (ZFS internal buffering is
    compensating for this effect, write throughput will suffer if recordsize is
    too high)

  • and you should switch off atime on ES data folder

Jörg

On Tue, May 13, 2014 at 7:39 AM, Patrick Proniewski <
elasticsearch@patpro.net> wrote:

Hello,

I'm running an Elasticsearch node on a FreeBSD server, on top of ZFS
storage. For now I've considered that ES is smart and manages its own
cache, so I've disabled primary cache for data, leaving only metadata being
cacheable. Last thing I want is to have data cached twice, one time is ZFS
ARC and a second time in application's own cache. I've also disabled
compression:

$ zfs get compression,primarycache,recordsize zdata/elasticsearch
NAME PROPERTY VALUE SOURCE
zdata/elasticsearch compression off local
zdata/elasticsearch primarycache metadata local
zdata/elasticsearch recordsize 128K default

It's a general purpose server (web, mysql, mail, ELK, etc.). I'm not
looking for absolute best ES performance, I'm looking for best use of my
resources.
I have 16 GB RAM, and I plan to put a limit to ARC size (currently
consuming 8.2 GB RAM) so I can mlockall ES memory. But I don't think I'll
go the RAM-only storage route (<
http://jprante.github.io/applications/2012/07/26/Mmap-with-Lucene.html>)
as I'm running only one node.

How can I estimate the amount of memory I must allocate to ES process?

Should I switch primarycache=all back on despite ES already caching data?

What is the best ZFS record/block size to accommodate Elasticsearch/Lucene
IOs?

Thanks,
Patrick

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/FBBA84AE-D610-4060-AFBC-FC7D5BA0803F%40patpro.net
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFeK_eTvLSEZ3BGgQGmWEzX5Y4v2AdWo8KZoywVe48zBg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Jörg,

On 21 mai 2014, at 13:49, joergprante@gmail.com wrote:

  • estimating ES process memory really depends on individual requirements
    (bulk indexing, field cache, filter/facet, concurrent queries) - just take
    a portion of your data, measure memory/CPU/disk I/O, and extrapolate - best
    is to add nodes if resources get tight. Rule of thumb is 50% of RAM to ES
    heap

I'm not really sure to understand what you mean by "just take a portion of your data". Am I supposed to make a query in kibana that will return a known amount of data, and measure mem/cpu/io during the request, then extrapolate to get the amount of those resources needed to return all my data?

  • you are correct, primarycache=all may buffer more data than required
    (useful for maximum ZFS performance). You have already limited the ARC
    size. Use mmapfs for ES store, this should work best with
    primarycache=metadata

Ok, I was mistaken about mmapfs, I've read some documentation and now it looks a bit clearer to me.

  • ZFS recordsize for JVM apps like ES should be default which is 4k. Also
    with ES, important is to match ZFS recordsize with kernel page size and
    sector size of the drive so there is no skew in the number of I/O
    operations. Check for yourself if higher values like 8k /16k / 64k / 256k
    gets better throughput on ES data folder. On certain striped HW RAID
    devices it may be the case, but I doubt it (ZFS internal buffering is
    compensating for this effect, write throughput will suffer if recordsize is
    too high)

My FS is (should be?) properly aligned on the physical 4K block HDD, so it should be quite efficient to move to a 4k blocksize ZFS volume if it's best for ES.
I'll make some measurements of I/O to make sure performances are not going down.
Every page size is 4k (FreeBSD 9.x):

$ sysctl -a | egrep page_?size:
vm.stats.vm.v_page_size: 4096
hw.pagesize: 4096
p1003_1b.pagesize: 4096

  • and you should switch off atime on ES data folder

I can do that too.

Thank you for your reply.

On Tue, May 13, 2014 at 7:39 AM, Patrick Proniewski <
elasticsearch@patpro.net> wrote:

Hello,

I'm running an Elasticsearch node on a FreeBSD server, on top of ZFS
storage. For now I've considered that ES is smart and manages its own
cache, so I've disabled primary cache for data, leaving only metadata being
cacheable. Last thing I want is to have data cached twice, one time is ZFS
ARC and a second time in application's own cache. I've also disabled
compression:

$ zfs get compression,primarycache,recordsize zdata/elasticsearch
NAME PROPERTY VALUE SOURCE
zdata/elasticsearch compression off local
zdata/elasticsearch primarycache metadata local
zdata/elasticsearch recordsize 128K default

It's a general purpose server (web, mysql, mail, ELK, etc.). I'm not
looking for absolute best ES performance, I'm looking for best use of my
resources.
I have 16 GB RAM, and I plan to put a limit to ARC size (currently
consuming 8.2 GB RAM) so I can mlockall ES memory. But I don't think I'll
go the RAM-only storage route (<
http://jprante.github.io/applications/2012/07/26/Mmap-with-Lucene.html>)
as I'm running only one node.

How can I estimate the amount of memory I must allocate to ES process?

Should I switch primarycache=all back on despite ES already caching data?

What is the best ZFS record/block size to accommodate Elasticsearch/Lucene
IOs?

Thanks,
Patrick

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/FBBA84AE-D610-4060-AFBC-FC7D5BA0803F%40patpro.net
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFeK_eTvLSEZ3BGgQGmWEzX5Y4v2AdWo8KZoywVe48zBg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/EAB4803E-940C-4DAD-8C29-CBEBB9BCE7CA%40patpro.net.
For more options, visit https://groups.google.com/d/optout.

On 21 mai 2014, at 14:24, Patrick Proniewski wrote:

  • ZFS recordsize for JVM apps like ES should be default which is 4k. Also
    with ES, important is to match ZFS recordsize with kernel page size and
    sector size of the drive so there is no skew in the number of I/O
    operations. Check for yourself if higher values like 8k /16k / 64k / 256k
    gets better throughput on ES data folder. On certain striped HW RAID
    devices it may be the case, but I doubt it (ZFS internal buffering is
    compensating for this effect, write throughput will suffer if recordsize is
    too high)

My FS is (should be?) properly aligned on the physical 4K block HDD, so it should be quite efficient to move to a 4k blocksize ZFS volume if it's best for ES.
I'll make some measurements of I/O to make sure performances are not going down.
Every page size is 4k (FreeBSD 9.x):

$ sysctl -a | egrep page_?size:
vm.stats.vm.v_page_size: 4096
hw.pagesize: 4096
p1003_1b.pagesize: 4096

After changing recordsize to 4k and moving away/moving back my data so they are written with the new block size, I see an impressive difference in IO and bandwidth usage. I've tested the same kibana request (get everything, with filter program:apache, for last 30 days)

before (recordsize=128k -> variable)

          capacity     operations    bandwidth

pool alloc free read write read write


zdata 661G 1.17T 3 41 65.2K 267K
zdata 661G 1.17T 554 41 47.7M 351K
zdata 661G 1.17T 424 24 43.5M 725K
zdata 661G 1.17T 465 54 50.4M 838K
zdata 661G 1.17T 2 36 54.8K 179K

after (recordsize=4k -> fixed)

          capacity     operations    bandwidth

pool alloc free read write read write


zdata 661G 1.17T 1 16 6.80K 72.4K
zdata 661G 1.17T 1.46K 15 5.88M 64.4K
zdata 661G 1.17T 3 42 12.4K 243K

Display in Kibana does not feel faster, so I guess I have another bottleneck somewhere (network maybe, it's a remote server, over DSL). On the disk bandwidth side, this is clearly a huge win.

Patrick

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9C24EA73-1178-44B9-AE43-A293BD3F3857%40patpro.net.
For more options, visit https://groups.google.com/d/optout.

What I want to say is, disable or play around with primarycache in ZFS is really a bad idea.
So set primarycache=metadata is bad/wrong practice for ElasticSearch.

We got a heavy hit on performance with set primarycache=metadata for taking advice from this thread, and Even with a L2ARC (secondarycache=all), performance is same bad.
It tooks a few days to figure out and fix the problem as we a new to both system.

First, from https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-store.html,
mmapfs maps to Lucene MMapDirectory, and we should know mmap depend on OS file cache/VM to read quickly.
"http://elasticsearch-users.115913.n3.nabble.com/How-to-run-elastic-from-memory-td4031249.html"

Second, ZFS only fill l2arc with entry retire from arc, so set secondarycache=all is no use when set primarycache=metadata.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209396
http://weblog.etherized.com/posts/185.html

So we got very low hit rate on arc and l2arc, and random read IO cause delay on our system.

Hello,

Well, you might want to ignore this thread. Seriously, it's from 2014, based on a no longer supported FreeBSD release and on an ELK version that probably has not so much in common with current one.

And more importantly, don't take answers for granted without considering the context first. I was not looking for state of the art performances, the context was:

Thanks for your insight about primarycache=metadata, which is considered a suitable setting for database workload with high write throughput. For Elasticsearch, however, query workload will typically overweigh other workload patterns. So if ZFS ARC can be adapted to coexist with mmap() cached files of most recent Lucene 6, it is reasonable to use primarycache=all.

Another ZFS-specific issue on Linux is that Linux reads 128k at each random read from a file system, while ZFS block access will be 4k for each call, which will lead to a 32x higher IO random read rate when caching is effectively disabled by primarycache=metadata, resulting in poor performance.

ZFS has quite a learning curve and it took me months to configure ZFS for an Oracle DB on Solaris (which I migrated finally to XFS under Linux).