Is this a right way to do performance evaluation?

Hi
I am doing a performance evaluation with 2 queries. The purpose is to figure out how many docs one node can process.
But Im not sure if Im using the correct way to test. Also not sure if the conclusion I got is correct. Because they seems too small.

Below are the hardware/software env I`m using.

CPU 8core
Memory 16G
OS centOS 7
elasticsearch 2.1
8G memory for ES heap
no swap

The requirements is that

all query response time less 3 seconds.

The map I use is like below:

"@timestamp": { "index": "not_analyzed", "type": "date","format": "epoch_second"},
"ROUTER" : { "type" : "integer", "index" : "not_analyzed"},
"IN_IFACE" : { "type" : "integer", "index" : "not_analyzed"},
"OUT_IFACE" : { "type" : "integer", "index" : "not_analyzed"},
"SRC_MAC" : { "type" : "long", "index" : "not_analyzed"},
"DST_MAC" : { "type" : "long", "index" : "not_analyzed"},
"IP_PAIRE" : { "type" : "integer", "index" : "not_analyzed"},
"SRC_IP" : { "type" : "ip", "index" : "not_analyzed"},
"DST_IP" : { "type" : "ip", "index" : "not_analyzed"},
"SRC_PORT" : { "type" : "integer", "index" : "not_analyzed"},
"DST_PORT" : { "type" : "integer", "index" : "not_analyzed"},
"VLAN" : { "type" : "integer", "index" : "not_analyzed" },
"PROTOCOL" : { "type" : "string", "index" : "not_analyzed" },
"BYTES" : { "type" : "long", "index" : "not_analyzed" },
"PACKETS" : { "type" : "long", "index" : "not_analyzed" }

The query are:

curl -XPOST "127.0.0.1:9200/sflow_1454256000/sflow/_search?pretty" -d '
{
	"size":0,
	"query": {
		"filtered":{
			"filter":{
				"range": { "@timestamp": {"gte":0,"lte":9999999999}}
			}
		}
	},
  "aggs":
	{
	  "SRC_DST":
		{
		  "terms": {"script": "[doc.SRC_IP.value, doc.DST_IP.value].join(\"-\")","size": 2,"shard_size":0, "order": {"sum_bits": "desc"}},
		  "aggs": { "sum_bits": { "sum": {"field": "BYTES"} } }
		}
	}
}'

curl -XPOST "127.0.0.1:9200/sflow_1454256000/sflow/_search?pretty" -d '
{
	"size":0,
	"query": {
		"filtered":{
			"filter":{
				"range": { "@timestamp": { "gte":0,"lte":9999999999}}
			}
		}
	},
  "aggs":
	{
	  "SRC_MAC":
		{
		  "terms": {"field":"SRC_MAC", "size": 3,"shard_size":0, "order": {"sum_bits": "desc"}},
		  "aggs": { "sum_bits": { "sum": {"field": "BYTES"} } }
		}
	}
}'

Based on the offical docs. You need to find out the max number of docs for one shard. Then find out how many shards your computer can hold.

Below are my steps:

First, I use only one shard for one index in my node.

When the shard have 40 000 docs , response time is 3s.

Then I know that in my environment,

the limit for one shard is 40 000 docs if I want the response time within 3s.

Based on this, I use multiple shards to run those two queries. Each shard contains 40 000 docs. The result is :

When the number of shards is 2 (each shard have 400 000 docs), the response time is 3s.

Then I get the conclusion that,

the limit for my computer is 2 shards, each have 400 000 docs(totaly 800 000 docs). If I want the response time within 3s


So, Could you help to point out if I`m doing the test in a wrong way?

Also, I do not have much experience in using ES. So is it a reasonable result given my hardware/software environment and the 3s response requirement?

------------------update-------------

I actually did some other test too. I use 5 shards, totaly contains 900 000 docs. (Means 180 000 docs per shard). The response time actually better than the 2 shard, 800 000 docs way. So I think this proved that my previous conclusion is wrong. But I`m doing the test based on the offical docs advise. Can any one tell me why?

The offical documents link that I`m doing my test based on. https://www.elastic.co/guide/en/elasticsearch/guide/current/capacity-planning.html

It is important that you run tests on realistic data and queries. Are you always going to run queries for a fixed set of time or will the time period vary? If you test with a single time interval, this will affect how efficiently Elasticsearch can cache data and may skew results.

You can also look at this video from Elastic{ON} around cluster sizing.

Hi Christian,

Thanks for your reply. I will run query on a fixed range of time - last 24 hours. But the start time and end time of the last 24 hour will change every minute.

And my question is not about the time range or something. I am just trying to figure out how many docs my computer can process, and I`m using my way to do the estimation. I just do not know if the my way is correct, also I am not sure if the result I got make sense. Because it seems too small.

400000 documents per shard does seem very little for reasonably small, structured records, especially with what looks like reasonably light queries. Make sure you run your queries with different 24 hour time intervals like you would in real life. Don't repeat a query with a set interval over and over.

Gradually increase the number of documents in the shard and record the average and maximum query latency so you get a graph of query latency as a function of number of documents in the shard.

How are you running the benchmark? How are you measuring the latency?

Hi Christian,

The way I run my test is as below:

First, I increase the number of docs in one shards gradually. Then every time I increase the number, I use same query to test the response time. It turns out 400 000 is the limit for my computer (If I want the response time with 3 seconds.)

Then I increase the number of shards and I make sure that each shards will contain 400 000 docs. Every time I increase the number of shards, I use the same query to check the response time. I found that the max number is 2 if I want the response time within 3 seconds.

So conclusion : the limitation of my computer is 2 shards, each shards have no more than 400 000 document. I do not know if I am doing the test in a right way. That is why I ask question here.

That sounds correct.

Hi Christian

Glad to know that I`m doing the test in right way:) But unfortunately, the result seems not correct.
The first step:
I found the limitation of one shard is 400 000
The second step:
I found the limitation of shards number is 2. So the doc numbers of my computer is 800 000.

But after all this, I use 3 shards , each containing 300 000 doc. Totally have 900 000 docs. And the performance is better than 2*400 000.

Do you have any idea why this may happen?

Do you have anything else running on the machine hosting Elasticsearch that could affect performance and available resources?

Hi Christian

Nothing else, I installed a fresh new machine to do the test. Also I noticed there is something called ES-hadoop or ES-spark.
I do not know what are they but they seems big data related. So does that mean the pure ES not fit for the data analyze role?

Regards
Mingwei

I just noticed that you are using a script in one of your queries. Scripts can be slow and use a lot of CPU, especially for large amounts of data. Do you see a big difference in latency between your 2 example queries? If so, create the field you are scripting at index time so you don't need to do it at query time (or modify the query as a nested aggregation based on the two fields).

Hi Christian

The two query does not have much different. 4-5 seconds may be. I know that script will slow down my query. But the query without a script is also very slow. If I increase the docs number to 20 000 000. It may never return.
I think even Mysql can be faster than this..

Regards
Mingwei

What type of hardware are you running on? How many CPU cores? How much RAM? What type of storage?

It is a virtual machine. Even a virtual machine is performance poor. It can not be so bad. Right?

8 core CPU
16G memory

Regards
Mingwei

How much heap space have you assigned? How large are you indices/shards? Which version of Elasticsearch are you using?

CPU 8core
Mem 16G
ES java heap 8G
400 000 docs per shard.
Only one index, one shard in the cluster. The cluster contains only one computer