Performance issues

While testing with large amount of data I am reaching the point of first
performance issues. The initial situation as follows:

  • one ES node with 8GB heap assigned
  • one index with 110.000.000 documents
  • 78.000.000 docs assigned to single _type
  • histogram data and a sub-type of cardinality 20
  • histogram query using aggregation over sub-type runs fast (< 3 seconds)
  • histogram over whole index,_type but ignoring subtype run up to 50
    seconds (index is cold), on warm index the same query takes 10-12 seconds
  • there are currently no writes to index and index is optimized (this may
    change in future)
  • only one shard of size 30GB
  • one index per month
  • data for about 3-4 month into past
  • java 1.7u55 and es 1.4.1

My requirements:

  • query should return in <3 seconds
  • one index per month (or probably week)
  • continuous adding new data to recent index

Questions:

  1. How to find out the bottleneck of this query?
  2. What are the tuning options?
  3. Over time there are serious heap issues: the heap grows up and many time
    is spent in parallel+full gc. After restarting the used heap is about 3GB
    and several GCs will hold it on this level. But over hours the usage grows
    up towards 8GB and full gc is not able to cleanup here. A restart is
    required. Why?

regards,
markus

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fac725cd-d6f6-4fc2-b274-4af374695d82%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Can you give examples of the documents and the queries you use?

Jörg
Am 06.12.2014 14:27 schrieb msbreuer@gmail.com:

While testing with large amount of data I am reaching the point of first
performance issues. The initial situation as follows:

  • one ES node with 8GB heap assigned
  • one index with 110.000.000 documents
  • 78.000.000 docs assigned to single _type
  • histogram data and a sub-type of cardinality 20
  • histogram query using aggregation over sub-type runs fast (< 3 seconds)
  • histogram over whole index,_type but ignoring subtype run up to 50
    seconds (index is cold), on warm index the same query takes 10-12 seconds
  • there are currently no writes to index and index is optimized (this may
    change in future)
  • only one shard of size 30GB
  • one index per month
  • data for about 3-4 month into past
  • java 1.7u55 and es 1.4.1

My requirements:

  • query should return in <3 seconds
  • one index per month (or probably week)
  • continuous adding new data to recent index

Questions:

  1. How to find out the bottleneck of this query?
  2. What are the tuning options?
  3. Over time there are serious heap issues: the heap grows up and many
    time is spent in parallel+full gc. After restarting the used heap is about
    3GB and several GCs will hold it on this level. But over hours the usage
    grows up towards 8GB and full gc is not able to cleanup here. A restart is
    required. Why?

regards,
markus

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/fac725cd-d6f6-4fc2-b274-4af374695d82%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/fac725cd-d6f6-4fc2-b274-4af374695d82%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF%2BLuqo98a0HHOhxBEajdbUtw_3w_tx88WXf6QeD_ZM5A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Can you give examples of the documents and the queries you use?

Docs look like this:

{
"duration" : "74",
"caller" : "128287",
"session_id" : "12312",
"id" : "901",
"position" : "1",
"parameters" : "ffff",
"operation" : "export",
"timestamp" : "2014-01-15T14:17:06.245+01:00"
}

And this is the query:

{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"ge": "2013-12-31T23:00:00.000Z",
"lt": "2014-12-31T23:00:00.000Z"
}
}
},
{
"term": {
"_type": "export-op"
}
}
]
}
}
}
},
"aggregations": {
"duration-not-empty": {
"filter": {
"bool": {
"must": [
{
"not": {
"filter": {
"missing": {
"field": "duration"
}
}
}
},
{
"range": {
"duration": {
"gt": "0"
}
}
}
]
}
},
"aggregations": {
"durations": {
"date_histogram": {
"field": "timestamp",
"interval": "1M"
},
"aggregations": {
"duration-stats": {
"extended_stats": {
"field": "duration"
}
}
}
}
}
}
},
"size": 0
}

The extended_stats seems to be expensive. Instead using min/max/avg will
save some seconds. Also excluding _type from query has no effect, but i
think thos restriction is applied to url, too.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/83e0e265-2fa1-46a8-ad5f-c64981919fd5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

How many docs do you expect your histogram will aggregate? Most of your 111M? If so with just one shard and one thread doing the work it is bound to be pretty slow.

Also have you tried moving your not missing filter out of the agg into the query filter and also just using > 0 instead of not missing. Also reducing precision of the timestamp could possible help

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f8e36514-6f2a-4283-9f75-312aab3a2fea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

How many docs do you expect your histogram will aggregate? Most of your
111M? If so with just one shard and one thread doing the work it is bound
to be pretty slow.

Expected aggregated records are 78mio. After reindexing with 6 shards per
index the query time reduced by ~50%. The result was surprising: someone
wrote several shards on a single disk have less effect, because they share
the same i/o. But I should mention the threading effect. Are there
recommendations about shard size vs shard count?

Also have you tried moving your not missing filter out of the agg into the
query filter and also just using > 0 instead of not missing. Also reducing
precision of the timestamp could possible help

Removing the missing filter out of the query gives more speed. I cannot
remember why I used this missing filter. In current test setup the target
result set is identical, even if using 'missing filter'. Is there need to
use 'missing filter' here? What happens, if field 'duration' is missing or
null in some records?

What is your recommendation to timestamp? Should I replace

2014-01-15T14:17:06.245+01:00

with less accuracy in minutes

2014-01-15T14:17:00.000+01:00

? Would this affect the field data cache?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/884afafa-12f0-4ffe-ae42-358f27544894%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Missing filter is fairly costly. I do not believe you need it as > 0 should
take care of excluding nulls

one thread can act on one shard at the same time so the only way you can
parallelize you query is by splitting it onto more shards to let multiple
threads do parallel work on smaller sized shards. So if your server has say
16 cores you may consider roughly the same number of shards (maybe a bit
fewer)
If it is IO bound rather than CPU bound, more memory for OS level caching
and probably bumping up ES heap as well could help, as well as faster
storage - SSDs work great with ES and at some point you may need to have
several nodes

I believe reducing date precision would decrease number of unique terms in
the index and may help with hystogram. Say, if your histogram precision
needs date only and not time I would not even index time part (note you may
use multifield mapping if you need both precise and date rounded timestamp)

On Sunday, December 7, 2014 5:14:17 AM UTC-5, msbr...@gmail.com wrote:

How many docs do you expect your histogram will aggregate? Most of your

111M? If so with just one shard and one thread doing the work it is bound
to be pretty slow.

Expected aggregated records are 78mio. After reindexing with 6 shards per
index the query time reduced by ~50%. The result was surprising: someone
wrote several shards on a single disk have less effect, because they share
the same i/o. But I should mention the threading effect. Are there
recommendations about shard size vs shard count?

Also have you tried moving your not missing filter out of the agg into
the query filter and also just using > 0 instead of not missing. Also
reducing precision of the timestamp could possible help

Removing the missing filter out of the query gives more speed. I cannot
remember why I used this missing filter. In current test setup the target
result set is identical, even if using 'missing filter'. Is there need to
use 'missing filter' here? What happens, if field 'duration' is missing or
null in some records?

What is your recommendation to timestamp? Should I replace

2014-01-15T14:17:06.245+01:00

with less accuracy in minutes

2014-01-15T14:17:00.000+01:00

? Would this affect the field data cache?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/49ea1396-70bb-4b3f-a5f5-764d53445f79%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

and if you provide plenty of memory (8G heap for 111M records with
aggregations do not seem enough) for caching of filters and fields and OS
memory for caching data files (and/or use SSD) parallel calculation on
multiple shards should provide lot better improvement than 50% may be not
exactly linear but at least 3-4 times for going from 1 to 6 shards in my
opinion) assuming you have more than 6 cores. The memory pressure you
mention needs to be removed too. Analyze stats but I suspect 8G is just not
enough in your case.

Would be interesting to see if aggregating on rounded (to date) timestamp
would improve things on its own.

On Sunday, December 7, 2014 5:14:17 AM UTC-5, msbr...@gmail.com wrote:

How many docs do you expect your histogram will aggregate? Most of your

111M? If so with just one shard and one thread doing the work it is bound
to be pretty slow.

Expected aggregated records are 78mio. After reindexing with 6 shards per
index the query time reduced by ~50%. The result was surprising: someone
wrote several shards on a single disk have less effect, because they share
the same i/o. But I should mention the threading effect. Are there
recommendations about shard size vs shard count?

Also have you tried moving your not missing filter out of the agg into
the query filter and also just using > 0 instead of not missing. Also
reducing precision of the timestamp could possible help

Removing the missing filter out of the query gives more speed. I cannot
remember why I used this missing filter. In current test setup the target
result set is identical, even if using 'missing filter'. Is there need to
use 'missing filter' here? What happens, if field 'duration' is missing or
null in some records?

What is your recommendation to timestamp? Should I replace

2014-01-15T14:17:06.245+01:00

with less accuracy in minutes

2014-01-15T14:17:00.000+01:00

? Would this affect the field data cache?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b1ce1f3c-f75c-43a9-9eb8-c37116ec2453%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I believe reducing date precision would decrease number of unique terms in
the index and may help with hystogram. Say, if your histogram precision
needs date only and not time I would not even index time part (note you may
use multifield mapping if you need both precise and date rounded timestamp)

The histrogram should be zoomable: day over last 90 days, hour over last
72hours and minute over last 6 hours. Would you suppose to store different
precision timestamp (day/hour/minute) for performance issues? I understand
the concept of multivalue field. But in which way are the values filled? Is
there a way to only store the exact timestamp amd create the less precision
timestamps? Or should the less precision timestamps be contained in to
indexing document?
Do you have any examples for timestamp on multivalue fields?

regards,
markus

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8508ae32-6393-4f20-9d03-465b6eee01fd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

you could try to specify multiple fields in multifield mapping with string type (or type date) and different formats. Not sure if it is going to work though I typically do this kind of staff in actual data. maybe something like:
"timestamp": {
"type": "date",
"format": "date",
"fields": {
"year": {
"type": "string", (or maybe date if string does not invoke formatter?)
"format": "yyyy",
},
"year-month": {
"type": "string", (or maybe date if string does not invoke formatter?)
"format": "yyyy-MM",
}
}
},

I would do in data (makes it bigger but gives you complete freedom to define your dimension)

{
...
callStartTime:{timestamp:'full timestamp', time:'rounded to seconds', weekOfMonth:3, month:11, year:2014}
}

then you can choose to not index timestamp at all and index the rest.

if your histogram is based on "absolute" date/time not on date/time relative to today you could use term aggregation instead of ranges which should be faster potentially much faster

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2479246b-7c9f-45c7-b4d9-f19416575d1e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.