Performance issues

msbreuer · December 6, 2014, 1:27pm

While testing with large amount of data I am reaching the point of first
performance issues. The initial situation as follows:

one ES node with 8GB heap assigned
one index with 110.000.000 documents
78.000.000 docs assigned to single _type
histogram data and a sub-type of cardinality 20
histogram query using aggregation over sub-type runs fast (< 3 seconds)
histogram over whole index,_type but ignoring subtype run up to 50
seconds (index is cold), on warm index the same query takes 10-12 seconds
there are currently no writes to index and index is optimized (this may
change in future)
only one shard of size 30GB
one index per month
data for about 3-4 month into past
java 1.7u55 and es 1.4.1

My requirements:

query should return in <3 seconds
one index per month (or probably week)
continuous adding new data to recent index

Questions:

How to find out the bottleneck of this query?
What are the tuning options?
Over time there are serious heap issues: the heap grows up and many time
is spent in parallel+full gc. After restarting the used heap is about 3GB
and several GCs will hold it on this level. But over hours the usage grows
up towards 8GB and full gc is not able to cleanup here. A restart is
required. Why?

regards,
markus

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fac725cd-d6f6-4fc2-b274-4af374695d82%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · December 6, 2014, 10:35pm

Can you give examples of the documents and the queries you use?

Jörg
Am 06.12.2014 14:27 schrieb msbreuer@gmail.com:

While testing with large amount of data I am reaching the point of first
performance issues. The initial situation as follows:

one ES node with 8GB heap assigned

one index with 110.000.000 documents

78.000.000 docs assigned to single _type

histogram data and a sub-type of cardinality 20

histogram query using aggregation over sub-type runs fast (< 3 seconds)

histogram over whole index,_type but ignoring subtype run up to 50
seconds (index is cold), on warm index the same query takes 10-12 seconds

there are currently no writes to index and index is optimized (this may
change in future)

only one shard of size 30GB

one index per month

data for about 3-4 month into past

java 1.7u55 and es 1.4.1

My requirements:

query should return in <3 seconds

one index per month (or probably week)

continuous adding new data to recent index

Questions:

How to find out the bottleneck of this query?

What are the tuning options?

Over time there are serious heap issues: the heap grows up and many
time is spent in parallel+full gc. After restarting the used heap is about
3GB and several GCs will hold it on this level. But over hours the usage
grows up towards 8GB and full gc is not able to cleanup here. A restart is
required. Why?

regards,
markus

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/fac725cd-d6f6-4fc2-b274-4af374695d82%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/fac725cd-d6f6-4fc2-b274-4af374695d82%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF%2BLuqo98a0HHOhxBEajdbUtw_3w_tx88WXf6QeD_ZM5A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

msbreuer · December 7, 2014, 3:32am

Can you give examples of the documents and the queries you use?

Docs look like this:

{
"duration" : "74",
"caller" : "128287",
"session_id" : "12312",
"id" : "901",
"position" : "1",
"parameters" : "ffff",
"operation" : "export",
"timestamp" : "2014-01-15T14:17:06.245+01:00"
}

And this is the query:

{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"ge": "2013-12-31T23:00:00.000Z",
"lt": "2014-12-31T23:00:00.000Z"
}
}
},
{
"term": {
"_type": "export-op"
}
}
]
}
}
}
},
"aggregations": {
"duration-not-empty": {
"filter": {
"bool": {
"must": [
{
"not": {
"filter": {
"missing": {
"field": "duration"
}
}
}
},
{
"range": {
"duration": {
"gt": "0"
}
}
}
]
}
},
"aggregations": {
"durations": {
"date_histogram": {
"field": "timestamp",
"interval": "1M"
},
"aggregations": {
"duration-stats": {
"extended_stats": {
"field": "duration"
}
}
}
}
}
}
},
"size": 0
}

The extended_stats seems to be expensive. Instead using min/max/avg will
save some seconds. Also excluding _type from query has no effect, but i
think thos restriction is applied to url, too.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/83e0e265-2fa1-46a8-ad5f-c64981919fd5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

roytmana · December 7, 2014, 5:43am

How many docs do you expect your histogram will aggregate? Most of your 111M? If so with just one shard and one thread doing the work it is bound to be pretty slow.

Also have you tried moving your not missing filter out of the agg into the query filter and also just using > 0 instead of not missing. Also reducing precision of the timestamp could possible help

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f8e36514-6f2a-4283-9f75-312aab3a2fea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

msbreuer · December 7, 2014, 10:14am

How many docs do you expect your histogram will aggregate? Most of your
111M? If so with just one shard and one thread doing the work it is bound
to be pretty slow.

Expected aggregated records are 78mio. After reindexing with 6 shards per
index the query time reduced by ~50%. The result was surprising: someone
wrote several shards on a single disk have less effect, because they share
the same i/o. But I should mention the threading effect. Are there
recommendations about shard size vs shard count?

Also have you tried moving your not missing filter out of the agg into the
query filter and also just using > 0 instead of not missing. Also reducing
precision of the timestamp could possible help

Removing the missing filter out of the query gives more speed. I cannot
remember why I used this missing filter. In current test setup the target
result set is identical, even if using 'missing filter'. Is there need to
use 'missing filter' here? What happens, if field 'duration' is missing or
null in some records?

What is your recommendation to timestamp? Should I replace

2014-01-15T14:17:06.245+01:00

with less accuracy in minutes

2014-01-15T14:17:00.000+01:00

? Would this affect the field data cache?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/884afafa-12f0-4ffe-ae42-358f27544894%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

roytmana · December 7, 2014, 6:12pm

Missing filter is fairly costly. I do not believe you need it as > 0 should
take care of excluding nulls

one thread can act on one shard at the same time so the only way you can
parallelize you query is by splitting it onto more shards to let multiple
threads do parallel work on smaller sized shards. So if your server has say
16 cores you may consider roughly the same number of shards (maybe a bit
fewer)
If it is IO bound rather than CPU bound, more memory for OS level caching
and probably bumping up ES heap as well could help, as well as faster
storage - SSDs work great with ES and at some point you may need to have
several nodes

I believe reducing date precision would decrease number of unique terms in
the index and may help with hystogram. Say, if your histogram precision
needs date only and not time I would not even index time part (note you may
use multifield mapping if you need both precise and date rounded timestamp)

On Sunday, December 7, 2014 5:14:17 AM UTC-5, msbr...@gmail.com wrote:

How many docs do you expect your histogram will aggregate? Most of your

111M? If so with just one shard and one thread doing the work it is bound
to be pretty slow.

Expected aggregated records are 78mio. After reindexing with 6 shards per
index the query time reduced by ~50%. The result was surprising: someone
wrote several shards on a single disk have less effect, because they share
the same i/o. But I should mention the threading effect. Are there
recommendations about shard size vs shard count?

Also have you tried moving your not missing filter out of the agg into
the query filter and also just using > 0 instead of not missing. Also
reducing precision of the timestamp could possible help

Removing the missing filter out of the query gives more speed. I cannot
remember why I used this missing filter. In current test setup the target
result set is identical, even if using 'missing filter'. Is there need to
use 'missing filter' here? What happens, if field 'duration' is missing or
null in some records?

What is your recommendation to timestamp? Should I replace

2014-01-15T14:17:06.245+01:00

with less accuracy in minutes

2014-01-15T14:17:00.000+01:00

? Would this affect the field data cache?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/49ea1396-70bb-4b3f-a5f5-764d53445f79%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

roytmana · December 7, 2014, 6:21pm

and if you provide plenty of memory (8G heap for 111M records with
aggregations do not seem enough) for caching of filters and fields and OS
memory for caching data files (and/or use SSD) parallel calculation on
multiple shards should provide lot better improvement than 50% may be not
exactly linear but at least 3-4 times for going from 1 to 6 shards in my
opinion) assuming you have more than 6 cores. The memory pressure you
mention needs to be removed too. Analyze stats but I suspect 8G is just not
enough in your case.

Would be interesting to see if aggregating on rounded (to date) timestamp
would improve things on its own.

On Sunday, December 7, 2014 5:14:17 AM UTC-5, msbr...@gmail.com wrote:

How many docs do you expect your histogram will aggregate? Most of your

111M? If so with just one shard and one thread doing the work it is bound
to be pretty slow.

Expected aggregated records are 78mio. After reindexing with 6 shards per
index the query time reduced by ~50%. The result was surprising: someone
wrote several shards on a single disk have less effect, because they share
the same i/o. But I should mention the threading effect. Are there
recommendations about shard size vs shard count?

Also have you tried moving your not missing filter out of the agg into
the query filter and also just using > 0 instead of not missing. Also
reducing precision of the timestamp could possible help

Removing the missing filter out of the query gives more speed. I cannot
remember why I used this missing filter. In current test setup the target
result set is identical, even if using 'missing filter'. Is there need to
use 'missing filter' here? What happens, if field 'duration' is missing or
null in some records?

What is your recommendation to timestamp? Should I replace

2014-01-15T14:17:06.245+01:00

with less accuracy in minutes

2014-01-15T14:17:00.000+01:00

? Would this affect the field data cache?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b1ce1f3c-f75c-43a9-9eb8-c37116ec2453%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

msbreuer · December 7, 2014, 9:19pm

I believe reducing date precision would decrease number of unique terms in
the index and may help with hystogram. Say, if your histogram precision
needs date only and not time I would not even index time part (note you may
use multifield mapping if you need both precise and date rounded timestamp)

The histrogram should be zoomable: day over last 90 days, hour over last
72hours and minute over last 6 hours. Would you suppose to store different
precision timestamp (day/hour/minute) for performance issues? I understand
the concept of multivalue field. But in which way are the values filled? Is
there a way to only store the exact timestamp amd create the less precision
timestamps? Or should the less precision timestamps be contained in to
indexing document?
Do you have any examples for timestamp on multivalue fields?

regards,
markus

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8508ae32-6393-4f20-9d03-465b6eee01fd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

roytmana · December 8, 2014, 11:06pm

you could try to specify multiple fields in multifield mapping with string type (or type date) and different formats. Not sure if it is going to work though I typically do this kind of staff in actual data. maybe something like:
"timestamp": {
"type": "date",
"format": "date",
"fields": {
"year": {
"type": "string", (or maybe date if string does not invoke formatter?)
"format": "yyyy",
},
"year-month": {
"type": "string", (or maybe date if string does not invoke formatter?)
"format": "yyyy-MM",
}
}
},

I would do in data (makes it bigger but gives you complete freedom to define your dimension)

{
...
callStartTime:{timestamp:'full timestamp', time:'rounded to seconds', weekOfMonth:3, month:11, year:2014}
}

then you can choose to not index timestamp at all and index the rest.

if your histogram is based on "absolute" date/time not on date/time relative to today you could use term aggregation instead of ranges which should be faster potentially much faster

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2479246b-7c9f-45c7-b4d9-f19416575d1e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Indexing performance Elasticsearch	6	367	July 6, 2017
ES performance issues for 800G data per day Elasticsearch	9	509	July 6, 2017
Query performance issue - need help to investigate Elasticsearch	9	2192	July 5, 2017
Performance Issues Elasticsearch	3	448	July 6, 2017
Performance Issue(Java heap memory error) while Indexing 20k records with 10 fields on windows 64 bit machine Elasticsearch	2	666	July 6, 2017

Performance issues

Related topics