I could need some help in investigating a general performance issue that is going on on my ES 2.1.0 cluster.
First my problem: simple queries such as match_all (default size: 10, no ordering) constantly take around ~2250ms even on subsequent executions.
This is on a single index with ~240,000,000 documents of pretty small document size (overall the index is ~50GB)
This seem to be a general issue which of course slows down all other kinds of queries and probably aggregations.
Here are some details about my setup:
- Running Elasticsearch 2.1.0 on CentOS 6.7
- 4 Node Cluster (bare metal)
- 16GB Heap per instance with plenty of ram left for disk IO buffering
- Time-based indices (weekly) with each having ~240,000,000 documents and 50GB per index
- Each index got 2 shards with one replica (no routing)
- Configuration is pretty much default apart from
bootstrap.mlockall: trueand some recovery options set
- One storage path per instance (RAID10 spinning disks with pretty good read performance -700-800 MB/s for sequential reads)
While a match_all query runs there is no noticable disk activity (probably because it is either cached in ES or in the OS cache for subsequent reads).
Also I can not detect any spike in CPU usage.
Heap utilization is quite stable at ~50-60% for all instances.
I got the HQ Plugin installed and the only thing which does not look healthy there are the following (from Index Activity):
- Search - Query: 67.41ms [query_time_in_millis / query_total ]
- Search - Fetch: 29.75ms [fetch_time_in_millis / fetch_total]
- Get - Total: 10.32ms [get.time_in_millis / get.total]
- Get - Exists: 11.88ms [get.exists_time_in_millis / get.exists_total]
- Refresh: 3001.68ms [refresh.total_time_in_millis / refresh.total]
- Flush: 4371.25ms [flush.total_time_in_millis / flush.total]
To add: I only index once every week and am okay with the indexing performance (get 5000 docs per second with one single thread).
From the numbers this looks like a disk IO problem but all my tests on disk IO look very good (just tested with
hdparm -tT and reading a chunk of data with
Also as I mentioned this affects queries which are running multiple times and should be served from cache.
Can anyone help me to further investigate this?
Are there any other useful ES metrics that I could look at?