Poor performance and a lot of GC overhead

dboss101 · July 16, 2018, 12:01pm

Hi,

we are running elasticsearch with the following spec:

1 node on a hardware server
16GB of RAM
2GB assigned for heap
8 CPU cores (Intel Xeon CPU E5620 @ 2.40GHz)
HDD disks
Elasticsearch 5.6.3
JVM 1.8.0_171

we have around 20 indices, some daily, some weekly, some monthly.
each index has the default 5 shards and 1 replica.
each index gets might get around 1M documents per day at most.
there is indexing operation that is happening constantly, and every minute or so, there are also search operations that are happening.

the spec described above is what we have to work with and I know it sucks... but nothing much to do there.
the reason that there are only 2GB allocated for heap is that there are other components running on this machine and they also need resources...

the immediate issue we can pinpoint is that we are seeing a lot of GC overhead alerts in elasticsearch logs such as these:

[gc][999] overhead, spent [842ms] collecting in the last [1.4s]

so my questions are as follows:

what might be the cause for those GC overhead? and are they an issue to worry about (seems to me that it is an issue...)
what can we do to properly measure the actual capacity of this server? I mean to understand what is the performance I can get expect from elasticsearch given those constraints?
what would be a better sharding strategy given the above spec (that we have only 1 server in this setup).

thanks in advance!

Christian_Dahlqvist · July 16, 2018, 12:12pm

It sounds like you may be having a too many shards, especially given the small heap size. Read this blog post for a discussion around shards and sharding practices.

Until you can revise your sharding strategy, I would recommend increasing the heap size.

dboss101 · July 16, 2018, 12:14pm

thanks, so I was actually thinking to reduce the shard amount to 1 primary 0 replicas. since this is only a single node, so I don't see a point in having more than a single shard on a single node. but is that true or am I missing something?

Christian_Dahlqvist · July 16, 2018, 12:16pm

With single node there is no point in having replicas configured as Elasticsearch will not be able to allocate them anyway. How many shards do you have in the cluster? What is the average shard size?

dboss101 · July 16, 2018, 12:17pm

as I mentioned above, I have the default 5 shards per index. but the indices are rather small. none of them go beyond 1 GB in size...

Christian_Dahlqvist · July 16, 2018, 12:19pm

That is wasteful, but you may be having too many indices as well. What is the output of the cluster stats API?

dboss101 · July 16, 2018, 12:33pm

sorry for the weird way to share output... but I just don't have internet access (or copy paste for that matter to the machine at the moment... ), but here it is:

Christian_Dahlqvist · July 16, 2018, 12:38pm

62 shards for 407 MB of data is far too much. That data would easily fit within a single shard.

dboss101 · July 16, 2018, 12:41pm

I know.
this is a test machine. on production there will be more data.
but still it would be less than 1-2GB per index.
so that is why I was thinking to have just 1 shard per index.
(the separation to indices is mandatory, since those are different types of data from different applications).

are there any things to consider against using a single shard per index (other than HA, etc... since we already don't have that when running on a single node)

Christian_Dahlqvist · July 16, 2018, 12:47pm

That would be recommended. You may also want to consider having each index cover a longer retention period.

No.

dboss101 · July 16, 2018, 12:51pm

thank you very much.

what about my other question at the beginning of the thread?
how can I effectively benchmark this system with real life data, so that I can understand our limits with the given system?
any pointers on that one?

Christian_Dahlqvist · July 16, 2018, 1:00pm

Have a look at the following resources:

https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing

https://www.elastic.co/webinars/using-rally-to-get-your-elasticsearch-cluster-size-right

https://www.elastic.co/elasticon/conf/2018/sf/the-seven-deadly-sins-of-elasticsearch-benchmarking

dboss101 · July 16, 2018, 1:04pm

thank you very much.

dboss101 · August 9, 2018, 6:20pm

Hi again,
so I've been doing some benchmarking, and I do think that it's possible that our HD is actually the or a bottleneck in this case.
I'm getting results of around 60 MB/s for read speed when simply using hdparm, and something similar for write speed when using dd with blocks of 8k to test. results actually vary when testing with different counts of blocks. for example when trying 10k blocks (80MB for the test file in total), the speed is around 250 MB/s, but when trying 100k blocks, it goes down to around 100 MB/s, and when going up to 1000k (8GB file) it goes back to around 250 MB/s. so i'm not sure I really understand what's the deal here entirely. maybe the testing is faulty at some paint, but still the write speed never goes below 100 MB/s, that's for sure.

anyhow, the question I'm struggling with is how do I find out what is the real bottleneck for elasticsearch?

I've ran esrally with various tracks.
for example I've ran this track:

esrally --distribution-version=5.6.3 --track noaa --car 4gheap --challenge=append-no-conflicts --include-tasks="index" --track-params="bulk_size:10000,ingest_percentage:5"

and the results look like this:

can anyone give me some lead on what I can deduct from those and other metrics?
what is the capacity that I can count on, with the current setup?
what can I look to improve? what's most important to improve?

many thanks!

Christian_Dahlqvist · August 10, 2018, 6:24am

Elasticsearch performs a good amount of random access reads and writes, so comparing the throughput to a test that reads and/or writes quite large blocks may not be very accurate. I would recommend monitoring I/O performance, e.g. using iostat, while your cluster is running to see what I/O latencies and how much iowait you are experiencing. This will give a good indication whether the disk is the bottleneck or not (I suspect it likely is).

dboss101 · August 10, 2018, 8:31am

thank you. I have done some monitoring like iostat and other tools during normal operation of the cluster, and they all indicate some levels of iowait that the cpu is experiencing. sometimes around 20%, sometimes less, in some extreme cases I've seen even 70%.

but assuming this is the hardware I have to deal with, how can I best benchmark the cluster with esrally, but with my own data. because I'm not certain that the existing tracks are representing accurately my situation.

any pointers there?

thanks!

Christian_Dahlqvist · August 10, 2018, 8:36am

The best way is to create a custom Rally track that simulates your use case. Exactly how to best do that depends on the use-case you want to simulate. Can you provide some additional details about your use-case and the types of data you have?

dboss101 · August 10, 2018, 8:43am

well the main use case is ingestion of metric data from multiple network devices. some security data, some network traffic data, etc. most of the fields are either strings or different numeric fields.
what other additional details can I provide?

[Edit]:
and regarding the other side of things, this data is being queried by custom dashboards (not kibana), that are used as near live monitors and generate some alerts from this data.

Christian_Dahlqvist · August 10, 2018, 9:26am

You could have a look at the rally-eventdata-track, which simulates indexing and querying of nginx access logs. You might be able to adapt this or use it for inspiration when creating your own track. You may even get valuable information from running it as is even though it does not accurately reflect your data and access patterns.

system · September 7, 2018, 9:26am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shard size / Index number / server count and performance Elasticsearch	4	1388	July 6, 2017
Very bad shard allocation Elasticsearch	3	329	May 18, 2020
Elasticsearch performance issues after upgrade from 5.6 to 6.8 Elasticsearch	1	659	July 14, 2020
Elasticsearch heap issues Elasticsearch	4	438	July 5, 2017
Shards per node, Heap, 100% CPU....help please Elasticsearch	5	401	August 18, 2021

Poor performance and a lot of GC overhead

Related topics