Running ES 1.6.0 on Windows 2008 R2 Java 1.8_45 G1GC Setup
I have 4 nodes and each are: 32 cores 128GB RAM and 5TB Fusion IO
ES_HEAP_SIZE=30GB per node
6 "monthly" indexes of 8 shards + 1 replica
1.3 Billion Docs 13TB (includes backup)
4k-12k doc size avg
Shards are about 150GB each
All doc values
I have 1 fairly big aggregation.
- Pre filtered by 1 single "user id"
- Over 6 month period (all indexes)
- 29 aggregation, none nested
- All aggs are either sum or avg
- It's all doc values no field data cache.
When I run this query in Sense it takes about 10 seconds to produce. I guess that ok.
Then I run a load balanced "stress" test to all 4 nodes using JMeter to run a single sum agg, filtered by user and a random single date. That works fine.... but then I go back and I run the Big aggregation above and it takes 600 seconds to complete. I see 1 node IOPS spiking to 10K while 2 others are idle and other running at 2K.
I'm not running any other operations like bulk or anything like that. I wait for one test to finish to run the other. Apart the standard recovery cluster settings I haven't configured anything else it's all default for searches.
I also see that All nodes have almost 100% meme usage so 30GB for ES process and then the rest is mapped files.
I have noticed quite erratic performance with queries lately. Querie sthat used to take 2-3 seconds taking 600 seconds +
I looked at 1 node.
Working Set: 127GB
Memory Primary: 22GB
Commit Size: 41GB
Paged Pool: 7GB
Non Paged: 1GB
Page Faults: 450,000,000
PF Deltas: 15K