Looking for some suggestions on how to tackle some performance issues. I'll give a quick summary of the issues we are seeing and then follow up with the details about our configuration:

We are seeing bad GC performance on most of our nodes. Here's an example from node3 (we have 4 nodes):

As you can see, we are stuck around the 75% utilization mark. The node is constantly trying to clear up space, but failing. Here's some output from the GC log:

2017-04-25T15:47:14.529+0000: 4908696.646: [CMS-concurrent-sweep-start]
2017-04-25T15:47:16.381+0000: 4908698.497: [CMS-concurrent-sweep: 1.852/1.852 secs] [Times: user=1.97 sys=0.00, real=1.85 secs]
2017-04-25T15:47:16.381+0000: 4908698.497: [CMS-concurrent-reset-start]
2017-04-25T15:47:16.392+0000: 4908698.509: [CMS-concurrent-reset: 0.011/0.011 secs] [Times: user=0.01 sys=0.00, real=0.02 secs]
2017-04-25T15:47:18.392+0000: 4908700.508: [GC (CMS Initial Mark) [1 CMS-initial-mark: 4827897K(5950720K)] 4989708K(6257408K), 0.0080391 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
2017-04-25T15:47:18.400+0000: 4908700.517: [CMS-concurrent-mark-start]
2017-04-25T15:47:18.542+0000: 4908700.659: [GC (Allocation Failure) 2017-04-25T15:47:18.542+0000: 4908700.659: [ParNew
Desired survivor size 17432576 bytes, new threshold 6 (max 6)
- age   1:    2137280 bytes,    2137280 total
- age   2:     600304 bytes,    2737584 total
- age   3:     402168 bytes,    3139752 total
- age   4:     406696 bytes,    3546448 total
- age   5:    1698048 bytes,    5244496 total
: 284668K->8344K(306688K), 0.0117950 secs] 5112565K->4836241K(6257408K), 0.0119079 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2017-04-25T15:47:24.278+0000: 4908706.395: [CMS-concurrent-mark: 5.862/5.878 secs] [Times: user=7.42 sys=0.01, real=5.88 secs]
2017-04-25T15:47:24.278+0000: 4908706.395: [CMS-concurrent-preclean-start]
2017-04-25T15:47:24.295+0000: 4908706.411: [CMS-concurrent-preclean: 0.015/0.016 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
2017-04-25T15:47:24.295+0000: 4908706.412: [CMS-concurrent-abortable-preclean-start]
2017-04-25T15:47:25.372+0000: 4908707.489: [GC (Allocation Failure) 2017-04-25T15:47:25.372+0000: 4908707.489: [ParNew
Desired survivor size 17432576 bytes, new threshold 6 (max 6)
- age   1:    8805664 bytes,    8805664 total
- age   2:    1080224 bytes,    9885888 total
- age   3:     413984 bytes,   10299872 total
- age   4:     399160 bytes,   10699032 total
- age   5:     242432 bytes,   10941464 total
- age   6:    1693864 bytes,   12635328 total
: 280984K->14818K(306688K), 0.0130235 secs] 5108881K->4842715K(6257408K), 0.0131792 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]

Here is a snapshot of the GC activity on all four nodes (light purple is node3):

I'm expecting to see a sawtooth pattern on the heap graph (at the top of this post). I'm seeing this same behavior on all four nodes.

Now for the configuration....

  • We are running v2.3.
  • We have a 4 node cluster with 15 shards and 2 replicas.
  • We have around 625,000 documents.
  • We have our heap set at 6G (-Xms6g -Xmx6g).
  • We are running on 4 EC2 t2.xlarge instances (4 vCPU and 16G memory).
  • Data is stored on an encrypted volume, 100G gp2 with 300 IOPS.
  • We are doing full text searching and some aggregations.

We are consistently indexing new docs and updating/deleting already indexed data. The volume is pretty low, but it's consistent throughout the day:

Happy to provide any more details that might be helpful.

My questions:

  • Why is GC so flat (actually slowly increasing)?
  • What are the best EC2 instances to run?
  • Do we have too many shards? (15 was a guess we made about 2 years ago and just haven't played with)

Hi @Burke,

First of all, I am missing a full CMS cycle from your GC log snippet (watch for "CMS Final Remark"). It should tell you how much heap size in the old gen has been cleaned up. However, based on the graphs we can see that it cannot be that much anyway. You can use tools like GCViewer to help you with analysis of GC log files.

I guess you meant to ask "Why is heap usage (roughly) constant"? I'm probably stating the obvious but it seems to be flat because there is not much garbage to clean up. Maybe most of the heap size is occupied by caches? I suggest you take a heap dump and analyze the heap contents. I think this might help clarify the situation.

Unfortunately, I cannot really help you with that.

This depends less on the number of documents but more on their sizes. You can checkout the Capacity Planning section in the Defnitive Guide and the talk Quantitative Cluster Sizing by my colleagues Christian Dahlqvist and Ryan Schneider.

Note that changing the number of shards requires reindexing. When you upgrade to 5.x you should take a look at the shrink API and the rollover API that help you reduce the number of shards for old indices in case you have a time-based indices (like in logging use-cases).


