ElasticSearch 0.19.2 heap space shortage, becoming unresponsive and not recovering or releasing memory

Hi,

On Tuesday, May 1, 2012 1:56:59 PM UTC-4, Sujoy Sett wrote:

We were trying jmeter testing on elasticsearch queries that are being used
in our application. We ran single user as well as five concurrent user
tests via jmeter.
Following are the findings:

  1. Regarding the data sample that I posted early in the mail trail, and
    the kind of query I posted, a node of 2GB max heap size is being able to
    serve a query n 100000 data volume. On increasing the data volume, the node
    is facing OOM. *My question is, will dividing the data into more shards,
    and adding more nodes (with same configuration), help me avoid hitting OOM?

Yes, adding more hardware will help. Which is not to say that you need
to add more hardware - it could be that you can avoid OOMs with your
existing hardware - maybe something can be tuned.

  1. I have used two configurations here, one - *multiple nodes in one

machine, with less heap space each node*. two - single node in one
machine, with more heap space
. Which one is better in terms of
concurrent requests, heavy requests (terms facets), and what is the best
shard configuration?

I'd go with the latter.

  1. Regarding recovery from OOM, elasticsearch is showing random behavior.
    We have switched off dumping heap to file. Still sometimes ES recovers from
    OOM, sometimes not. How to ensure avoidance of OOM from requests only? I
    mean something like when a query is causing a tending to OOM, identifying
    and aborting that query only, without making ES unresponsive.
    Does it
    sound absurd?

I think so, but maybe I'm missing something. I'm actually a bit suspicious
about the JVM recovering from an OOM. Can that really happen? I don't
think I've ever seen that happen - JVM OOMs when it really cannot perform
enough GC to make enough room for new objects. So I can't see how the JVM
could truly recover from OOM.

  1. Our ES installation has some 50 indexes in total. After a shutdown, it
    typically takes some 5-10 minutes to get the green state, and before that,
    queries tend to result in UnavailableShardException. Can we control or
    speed up the recovery of some indexes on priority than others.

Yes. Have a look at this thread:
https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/es$20shard$20dance/elasticsearch/iyWxQaF-FGU/9nToZElZu2YJ

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -

Thanks,

On Saturday, April 28, 2012 11:31:45 PM UTC+5:30, Sujoy Sett wrote:

Hi,

One quick observation, when a single node is maintained for a
cluster, recovery from OOM is happening normally, though it is not that
fast.
But when the cluster is having two nodes, upon OOM the nodes are coming
to a standstill (no response available, CPU usage minimal, memory blocked
to maximum allowed size). On shutting down one node, the other is returning
to responsive state.
We changed multi-cast discovery to uni-cast, played a little with
discovery timeout parameters, with no avail.
What are we missing here, any suggestions?

Thanks and Regards,

On Friday, April 27, 2012 9:47:54 PM UTC+5:30, jagdeep singh wrote:

Hi Otis,

Thanks a lot for your response.
We will definitely try the approaches you have suggested and update
you soon.

Thanks and Regards
Jagdeep

On Apr 27, 9:12 pm, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

Hi Sujoy,

Say hi to Ian from Otis please :wink:

And about monitoring - we've used SPM for Elasticsearch to see and
understand behaviour of ES cache(s). Since we can see trend graphs in
SPM
for ES, we can see how the cache size changes when we run queries vs.
when
we use sort vs. when we facet on field X or X and Y, etc. And we can
see
that on the per-node basis, too. So having and seeing this data over
time
also helps with your "Just out of inquisitiveness, what is ES doing
internally?" question. :slight_smile:

You can also clear FieldCache for a given field and set TTL on it.
And since you mention using this for tag cloud, normalizing your tags
to
reduce their cardinality will also help. We just did all this stuff
for a
large client (tag normalization, soft cache, cache clearing,
adjustment of
field types to those that use less memory, etc.) and SPM for ES came in
very handy, if I may say so! :slight_smile:

Otis

On Friday, April 27, 2012 6:42:35 AM UTC-4, Sujoy Sett wrote:

Hi,

Can u please explain how to check the field data cache ? Do I have
to set
anything to monitor explicitly?
I often use the mobz-elasticsearch-head-24935c4 plugin to monitor
cluster
state and health, I didn't find anything like
index.cache.field.max_size
there in the cluster_state details.

Thanks and Regards,

On Friday, April 27, 2012 3:52:04 PM UTC+5:30, Rafał Kuć wrote:

Hello,

Did you look at the size of the field data cache after sending
the
example query ?

Regards,
Rafał

W dniu piątek, 27 kwietnia 2012 12:15:38 UTC+2 użytkownik Sujoy Sett
napisał:

Hi,

We have been using elasticsearch 0.19.2 for storing and analyzing
data
from social media blogs and forums. The data volume is going up to
500000 documents per index, and size of this volume of data in
Elasticsearch index is going up to 3 GB per index per node (all
shards). We always maintain the number of replicas 1 less than the
total number of nodes to ensure that a copy of all shards should
reside on every node at any instant. The number of shards are
generally 10 for the size of indexes we mentioned above.

We try different queries on these data for advanced visualization
purpose, and mainly facets for showing trend charts or keyword
clouds.
Following are some example of the query we execute:
{
"query" : {
"match_all" : { }
},
"size" : 0,
"facets" : {
"tag" : {
"terms" : {
"field" : "nouns",
"size" : 100
},
"_cache":false
}
}
}

{
"query" : {
"match_all" : { }
},
"size" : 0,
"facets" : {
"tag" : {
"terms" : {
"field" : "phrases",
"size" : 100
},
"_cache":false
}
}
}

While executing such queries we often encounter heap space
shortage,
and the nodes becomes unresponsive. Our main concern is that the
nodes
do not recover to normal state even after dumping the heap to a
hprof
file. The node still consumes the maximum allocated memory as
shown in
task manager java.exe process, and the nodes remain unresponsive
until
we manually kill and restart them.

ES Configuration 1:
Elasticsearch Version 0.19.2
2 Nodes, one on each physical server
Max heap size 6GB per node.
10 shards, 1 replica.

ES Configuration 2:
Elasticsearch Version 0.19.2
6 Nodes, three on each physical server
Max heap size 2GB per node.
10 shards, 5 replica.

Server Configuration:
Windows 7 64 bit
64 bit JVM
8 GB pysical memory
Dual Core processor

For both the configuration mentioned above Elasticsearch was
unable to
respond to the facet queries mentioned above, it was also unable to
recover when a query failed due to heap space shortage.

We are facing this issue in our production environments, and
request
you to please suggest a better configuration or a different
approach
if required.

The mapping of the data is we use is as follows:
(keyword1 is a customized keyword analyzer, similarly standard1 is
a
customized standard analyzer)

{
"properties": {
"adjectives": {
"type": "string",
"analyzer": "stop2"
},
"alertStatus": {
"type": "string",
"analyzer": "keyword1"
},
"assignedByUserId": {
"type": "integer",
"index": "analyzed"
},
"assignedByUserName": {
"type": "string",
"analyzer": "keyword1"
},
"assignedToDepartmentId": {
"type": "integer",
"index": "analyzed"
},
"assignedToDepartmentName": {
"type": "string",
"analyzer": "keyword1"
},
"assignedToUserId": {
"type": "integer",
"index": "analyzed"
},
"assignedToUserName": {
"type": "string",
"analyzer": "keyword1"
},
"authorJsonMetadata": {
"properties": {
"favourites": {
"type": "string"
},
"followers": {
"type": "string"
},
"following": {
"type": "string"
},
"likes": {
"type": "string"
},
"listed": {
"type": "string"
},
"subscribers": {
"type": "string"
},
"subscription": {
"type": "string"
},
"uploads": {
"type": "string"
},
"views": {
"type": "string"
}
}
},
"authorKloutDetails": {
"dynamic": "true",
"properties": {
"amplificationScore": {
"type": "string"
},
"authorKloutDetailsFound": {
"type": "string"
},
"description": {
"type": "string"
},
"influencees": {
"dynamic": "true",
"properties": {
"kscore": {
"type": "string"
},
"twitter_screen_name": {
"type": "string"
}
}
},
"influencers": {
"dynamic": "true",
"properties": {
"kscore": {
"type": "string"
},
"twitter_screen_name": {
"type": "string"
}
}
},
"kloutClass": {
"type": "string"
},
"kloutClassDescription": {
"type": "string"
},
"kloutScore": {
"type": "string"
},
"kloutScoreDescription": {
"type": "string"
},
"kloutTopic": {
"type": "string"
},
"slope": {
"type": "string"
},
"trueReach": {
"type": "string"
},
"twitterId": {
"type": "string"
},
"twitterScreenName": {
"type":

...

read more »

Thanks a lot Otis.

Regarding the bit on JVM recovery after OOM, following is the scenario:

  1. We hit term facet query on a keyword analyzed string array field,
    on 100000 document index.
  2. The node stats API shows a lot of ups and downs in the cache field
    size
    .
  3. Simultaneously message like [2012-05-02 12:59:59,008][INFO
    ][monitor.jvm] [es_node_67] [gc][ConcurrentMarkSweep][3928][150] duration
    [5.1s], collections [1]/[5.7s], total [5.1s]/[47.9s], memory
    [5.9gb]->[5.9gb]/[5.9gb]
    are printed on the ES console, several times.
  4. Result from the query still not available.
  5. After some more waiting, messages like [2012-05-02
    11:48:55,908][WARN ][transport.netty] [es_node_67] Exception caught on
    netty layer [[id: 0x425e2b6e, /0:0:0:0:0:0:0:0:55091 =>
    172.29.177.102:9300]] java.lang.OutOfMemoryError: loading field [phrases]
    caused out of memory failure
    comes up.
  6. During this no API calls generate any response. Even the node stats
    API remains dumb.
  7. After this non-responsive stage the .hprof file is generated, if
    heap dump parameter is set to true.
  8. ES recovers from the non responsive stage, performs as usual. But the
    main concern is - this automated recovery cycle takes a lot of time, and
    often fails in case of larger setup of multiple nodes. In that case we have
    to manually restart all the instances, which is the very thing we are
    trying to avoid
    .

Any suggestions?

Thanks and Regards,

On Wednesday, May 2, 2012 10:13:23 AM UTC+5:30, Otis Gospodnetic wrote:

Hi,

On Tuesday, May 1, 2012 1:56:59 PM UTC-4, Sujoy Sett wrote:

We were trying jmeter testing on elasticsearch queries that are being
used in our application. We ran single user as well as five concurrent user
tests via jmeter.
Following are the findings:

  1. Regarding the data sample that I posted early in the mail trail, and
    the kind of query I posted, a node of 2GB max heap size is being able to
    serve a query n 100000 data volume. On increasing the data volume, the node
    is facing OOM. *My question is, will dividing the data into more shards,
    and adding more nodes (with same configuration), help me avoid hitting OOM?

Yes, adding more hardware will help. Which is not to say that you need
to add more hardware - it could be that you can avoid OOMs with your
existing hardware - maybe something can be tuned.

  1. I have used two configurations here, one - *multiple nodes in one

machine, with less heap space each node*. two - single node in one
machine, with more heap space
. Which one is better in terms of
concurrent requests, heavy requests (terms facets), and what is the best
shard configuration?

I'd go with the latter.

  1. Regarding recovery from OOM, elasticsearch is showing random behavior.
    We have switched off dumping heap to file. Still sometimes ES recovers from
    OOM, sometimes not. How to ensure avoidance of OOM from requests only?
    I mean something like when a query is causing a tending to OOM, identifying
    and aborting that query only, without making ES unresponsive.
    Does it
    sound absurd?

I think so, but maybe I'm missing something. I'm actually a bit
suspicious about the JVM recovering from an OOM. Can that really happen?
I don't think I've ever seen that happen - JVM OOMs when it really cannot
perform enough GC to make enough room for new objects. So I can't see how
the JVM could truly recover from OOM.

  1. Our ES installation has some 50 indexes in total. After a shutdown, it
    typically takes some 5-10 minutes to get the green state, and before that,
    queries tend to result in UnavailableShardException. Can we control or
    speed up the recovery of some indexes on priority than others.

Yes. Have a look at this thread:
Redirecting to Google Groups

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -
Sematext Monitoring | Infrastructure Monitoring Service

Thanks,

On Saturday, April 28, 2012 11:31:45 PM UTC+5:30, Sujoy Sett wrote:

Hi,

One quick observation, when a single node is maintained for a
cluster, recovery from OOM is happening normally, though it is not that
fast.
But when the cluster is having two nodes, upon OOM the nodes are coming
to a standstill (no response available, CPU usage minimal, memory blocked
to maximum allowed size). On shutting down one node, the other is returning
to responsive state.
We changed multi-cast discovery to uni-cast, played a little with
discovery timeout parameters, with no avail.
What are we missing here, any suggestions?

Thanks and Regards,

On Friday, April 27, 2012 9:47:54 PM UTC+5:30, jagdeep singh wrote:

Hi Otis,

Thanks a lot for your response.
We will definitely try the approaches you have suggested and update
you soon.

Thanks and Regards
Jagdeep

On Apr 27, 9:12 pm, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

Hi Sujoy,

Say hi to Ian from Otis please :wink:

And about monitoring - we've used SPM for Elasticsearch to see and
understand behaviour of ES cache(s). Since we can see trend graphs
in SPM
for ES, we can see how the cache size changes when we run queries vs.
when
we use sort vs. when we facet on field X or X and Y, etc. And we can
see
that on the per-node basis, too. So having and seeing this data over
time
also helps with your "Just out of inquisitiveness, what is ES doing
internally?" question. :slight_smile:

You can also clear FieldCache for a given field and set TTL on it.
And since you mention using this for tag cloud, normalizing your tags
to
reduce their cardinality will also help. We just did all this stuff
for a
large client (tag normalization, soft cache, cache clearing,
adjustment of
field types to those that use less memory, etc.) and SPM for ES came
in
very handy, if I may say so! :slight_smile:

Otis

On Friday, April 27, 2012 6:42:35 AM UTC-4, Sujoy Sett wrote:

Hi,

Can u please explain how to check the field data cache ? Do I have
to set
anything to monitor explicitly?
I often use the mobz-elasticsearch-head-24935c4 plugin to monitor
cluster
state and health, I didn't find anything like
index.cache.field.max_size
there in the cluster_state details.

Thanks and Regards,

On Friday, April 27, 2012 3:52:04 PM UTC+5:30, Rafał Kuć wrote:

Hello,

Did you look at the size of the field data cache after sending
the
example query ?

Regards,
Rafał

W dniu piątek, 27 kwietnia 2012 12:15:38 UTC+2 użytkownik Sujoy
Sett
napisał:

Hi,

We have been using elasticsearch 0.19.2 for storing and analyzing
data
from social media blogs and forums. The data volume is going up to
500000 documents per index, and size of this volume of data in
Elasticsearch index is going up to 3 GB per index per node (all
shards). We always maintain the number of replicas 1 less than the
total number of nodes to ensure that a copy of all shards should
reside on every node at any instant. The number of shards are
generally 10 for the size of indexes we mentioned above.

We try different queries on these data for advanced visualization
purpose, and mainly facets for showing trend charts or keyword
clouds.
Following are some example of the query we execute:
{
"query" : {
"match_all" : { }
},
"size" : 0,
"facets" : {
"tag" : {
"terms" : {
"field" : "nouns",
"size" : 100
},
"_cache":false
}
}
}

{
"query" : {
"match_all" : { }
},
"size" : 0,
"facets" : {
"tag" : {
"terms" : {
"field" : "phrases",
"size" : 100
},
"_cache":false
}
}
}

While executing such queries we often encounter heap space
shortage,
and the nodes becomes unresponsive. Our main concern is that the
nodes
do not recover to normal state even after dumping the heap to a
hprof
file. The node still consumes the maximum allocated memory as
shown in
task manager java.exe process, and the nodes remain unresponsive
until
we manually kill and restart them.

ES Configuration 1:
Elasticsearch Version 0.19.2
2 Nodes, one on each physical server
Max heap size 6GB per node.
10 shards, 1 replica.

ES Configuration 2:
Elasticsearch Version 0.19.2
6 Nodes, three on each physical server
Max heap size 2GB per node.
10 shards, 5 replica.

Server Configuration:
Windows 7 64 bit
64 bit JVM
8 GB pysical memory
Dual Core processor

For both the configuration mentioned above Elasticsearch was
unable to
respond to the facet queries mentioned above, it was also unable
to
recover when a query failed due to heap space shortage.

We are facing this issue in our production environments, and
request
you to please suggest a better configuration or a different
approach
if required.

The mapping of the data is we use is as follows:
(keyword1 is a customized keyword analyzer, similarly standard1
is a
customized standard analyzer)

{
"properties": {
"adjectives": {
"type": "string",
"analyzer": "stop2"
},
"alertStatus": {
"type": "string",
"analyzer": "keyword1"
},
"assignedByUserId": {
"type": "integer",
"index": "analyzed"
},
"assignedByUserName": {
"type": "string",
"analyzer": "keyword1"
},
"assignedToDepartmentId": {
"type": "integer",
"index": "analyzed"
},
"assignedToDepartmentName": {
"type": "string",
"analyzer": "keyword1"
},
"assignedToUserId": {
"type": "integer",
"index": "analyzed"
},
"assignedToUserName": {
"type": "string",
"analyzer": "keyword1"
},
"authorJsonMetadata": {
"properties": {
"favourites": {
"type": "string"
},
"followers": {
"type": "string"
},
"following": {
"type": "string"
},
"likes": {
"type": "string"
},
"listed": {
"type": "string"
},
"subscribers": {
"type": "string"
},
"subscription": {
"type": "string"
},
"uploads": {
"type": "string"
},
"views": {
"type": "string"
}
}
},
"authorKloutDetails": {
"dynamic": "true",
"properties": {
"amplificationScore": {
"type": "string"
},
"authorKloutDetailsFound": {
"type": "string"
},
"description": {
"type": "string"
},
"influencees": {
"dynamic": "true",
"properties": {
"kscore": {
"type": "string"
},
"twitter_screen_name": {
"type": "string"
}
}
},
"influencers": {
"dynamic": "true",
"properties": {
"kscore": {
"type": "string"
},
"twitter_screen_name": {
"type": "string"
}
}
},
"kloutClass": {
"type": "string"
},
"kloutClassDescription": {
"type": "string"
},
"kloutScore": {
"type": "string"
},
"kloutScoreDescription": {
"type": "string"
},
"kloutTopic": {
"type": "string"
},
"slope": {
"type": "string"
},
"trueReach": {
"type": "string"
},
"twitterId": {
"type": "string"
},
"twitterScreenName": {
"type":

...

read more »

Sujoy,

Any suggestions?

Yes! Get more hardware! :wink: Or just increase -Xmx first.
But before you do that..... are you trying to facet on that "phrases"
field? What's in it? Are all phrases normalized (e.g. lowercased)? Do
whatever you can to reduce their cardinality.

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -

On Wednesday, May 2, 2012 3:45:38 AM UTC-4, Sujoy Sett wrote:

Thanks a lot Otis.

Regarding the bit on JVM recovery after OOM, following is the scenario:

  1. We hit term facet query on a keyword analyzed string array field,
    on 100000 document index.
  2. The node stats API shows a lot of ups and downs in the cache
    field size
    .
  3. Simultaneously message like [2012-05-02 12:59:59,008][INFO
    ][monitor.jvm] [es_node_67] [gc][ConcurrentMarkSweep][3928][150] duration
    [5.1s], collections [1]/[5.7s], total [5.1s]/[47.9s], memory
    [5.9gb]->[5.9gb]/[5.9gb]
    are printed on the ES console, several times.
  4. Result from the query still not available.
  5. After some more waiting, messages like [2012-05-02
    11:48:55,908][WARN ][transport.netty] [es_node_67] Exception caught on
    netty layer [[id: 0x425e2b6e, /0:0:0:0:0:0:0:0:55091 =>
    172.29.177.102:9300]] java.lang.OutOfMemoryError: loading field
    [phrases] caused out of memory failure
    comes up.
  6. During this no API calls generate any response. Even the node stats
    API remains dumb.
  7. After this non-responsive stage the .hprof file is generated, if
    heap dump parameter is set to true.
  8. ES recovers from the non responsive stage, performs as usual. But the
    main concern is - this automated recovery cycle takes a lot of time, and
    often fails in case of larger setup of multiple nodes. In that case we have
    to manually restart all the instances, which is the very thing we are
    trying to avoid
    .

Any suggestions?

Thanks and Regards,

On Wednesday, May 2, 2012 10:13:23 AM UTC+5:30, Otis Gospodnetic wrote:

Hi,

On Tuesday, May 1, 2012 1:56:59 PM UTC-4, Sujoy Sett wrote:

We were trying jmeter testing on elasticsearch queries that are being
used in our application. We ran single user as well as five concurrent user
tests via jmeter.
Following are the findings:

  1. Regarding the data sample that I posted early in the mail trail, and
    the kind of query I posted, a node of 2GB max heap size is being able to
    serve a query n 100000 data volume. On increasing the data volume, the node
    is facing OOM. My question is, will dividing the data into more
    shards, and adding more nodes (with same configuration), help me avoid
    hitting OOM?

Yes, adding more hardware will help. Which is not to say that you need
to add more hardware - it could be that you can avoid OOMs with your
existing hardware - maybe something can be tuned.

  1. I have used two configurations here, one - *multiple nodes in one

machine, with less heap space each node*. two - single node in one
machine, with more heap space
. Which one is better in terms of
concurrent requests, heavy requests (terms facets), and what is the best
shard configuration?

I'd go with the latter.

  1. Regarding recovery from OOM, elasticsearch is showing random
    behavior. We have switched off dumping heap to file. Still sometimes ES
    recovers from OOM, sometimes not. How to ensure avoidance of OOM from
    requests only? I mean something like when a query is causing a tending to
    OOM, identifying and aborting that query only, without making ES
    unresponsive.
    Does it sound absurd?

I think so, but maybe I'm missing something. I'm actually a bit
suspicious about the JVM recovering from an OOM. Can that really happen?
I don't think I've ever seen that happen - JVM OOMs when it really cannot
perform enough GC to make enough room for new objects. So I can't see how
the JVM could truly recover from OOM.

  1. Our ES installation has some 50 indexes in total. After a shutdown,
    it typically takes some 5-10 minutes to get the green state, and before
    that, queries tend to result in UnavailableShardException. *Can we
    control or speed up the recovery of some indexes on priority than others.

Yes. Have a look at this thread:
Redirecting to Google Groups

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -
Sematext Monitoring | Infrastructure Monitoring Service

Thanks,

On Saturday, April 28, 2012 11:31:45 PM UTC+5:30, Sujoy Sett wrote:

Hi,

One quick observation, when a single node is maintained for a
cluster, recovery from OOM is happening normally, though it is not that
fast.
But when the cluster is having two nodes, upon OOM the nodes are coming
to a standstill (no response available, CPU usage minimal, memory blocked
to maximum allowed size). On shutting down one node, the other is returning
to responsive state.
We changed multi-cast discovery to uni-cast, played a little with
discovery timeout parameters, with no avail.
What are we missing here, any suggestions?

Thanks and Regards,

On Friday, April 27, 2012 9:47:54 PM UTC+5:30, jagdeep singh wrote:

Hi Otis,

Thanks a lot for your response.
We will definitely try the approaches you have suggested and update
you soon.

Thanks and Regards
Jagdeep

On Apr 27, 9:12 pm, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

Hi Sujoy,

Say hi to Ian from Otis please :wink:

And about monitoring - we've used SPM for Elasticsearch to see and
understand behaviour of ES cache(s). Since we can see trend graphs
in SPM
for ES, we can see how the cache size changes when we run queries
vs. when
we use sort vs. when we facet on field X or X and Y, etc. And we
can see
that on the per-node basis, too. So having and seeing this data
over time
also helps with your "Just out of inquisitiveness, what is ES doing
internally?" question. :slight_smile:

You can also clear FieldCache for a given field and set TTL on it.
And since you mention using this for tag cloud, normalizing your
tags to
reduce their cardinality will also help. We just did all this stuff
for a
large client (tag normalization, soft cache, cache clearing,
adjustment of
field types to those that use less memory, etc.) and SPM for ES came
in
very handy, if I may say so! :slight_smile:

Otis

On Friday, April 27, 2012 6:42:35 AM UTC-4, Sujoy Sett wrote:

Hi,

Can u please explain how to check the field data cache ? Do I have
to set
anything to monitor explicitly?
I often use the mobz-elasticsearch-head-24935c4 plugin to monitor
cluster
state and health, I didn't find anything like
index.cache.field.max_size
there in the cluster_state details.

Thanks and Regards,

On Friday, April 27, 2012 3:52:04 PM UTC+5:30, Rafał Kuć wrote:

Hello,

Did you look at the size of the field data cache after sending
the
example query ?

Regards,
Rafał

W dniu piątek, 27 kwietnia 2012 12:15:38 UTC+2 użytkownik Sujoy
Sett
napisał:

Hi,

We have been using elasticsearch 0.19.2 for storing and
analyzing data
from social media blogs and forums. The data volume is going up
to
500000 documents per index, and size of this volume of data in
Elasticsearch index is going up to 3 GB per index per node (all
shards). We always maintain the number of replicas 1 less than
the
total number of nodes to ensure that a copy of all shards should
reside on every node at any instant. The number of shards are
generally 10 for the size of indexes we mentioned above.

We try different queries on these data for advanced visualization
purpose, and mainly facets for showing trend charts or keyword
clouds.
Following are some example of the query we execute:
{
"query" : {
"match_all" : { }
},
"size" : 0,
"facets" : {
"tag" : {
"terms" : {
"field" : "nouns",
"size" : 100
},
"_cache":false
}
}
}

{
"query" : {
"match_all" : { }
},
"size" : 0,
"facets" : {
"tag" : {
"terms" : {
"field" : "phrases",
"size" : 100
},
"_cache":false
}
}
}

While executing such queries we often encounter heap space
shortage,
and the nodes becomes unresponsive. Our main concern is that the
nodes
do not recover to normal state even after dumping the heap to a
hprof
file. The node still consumes the maximum allocated memory as
shown in
task manager java.exe process, and the nodes remain unresponsive
until
we manually kill and restart them.

ES Configuration 1:
Elasticsearch Version 0.19.2
2 Nodes, one on each physical server
Max heap size 6GB per node.
10 shards, 1 replica.

ES Configuration 2:
Elasticsearch Version 0.19.2
6 Nodes, three on each physical server
Max heap size 2GB per node.
10 shards, 5 replica.

Server Configuration:
Windows 7 64 bit
64 bit JVM
8 GB pysical memory
Dual Core processor

For both the configuration mentioned above Elasticsearch was
unable to
respond to the facet queries mentioned above, it was also unable
to
recover when a query failed due to heap space shortage.

We are facing this issue in our production environments, and
request
you to please suggest a better configuration or a different
approach
if required.

The mapping of the data is we use is as follows:
(keyword1 is a customized keyword analyzer, similarly standard1
is a
customized standard analyzer)

{
"properties": {
"adjectives": {
"type": "string",
"analyzer": "stop2"
},
"alertStatus": {
"type": "string",
"analyzer": "keyword1"
},
"assignedByUserId": {
"type": "integer",
"index": "analyzed"
},
"assignedByUserName": {
"type": "string",
"analyzer": "keyword1"
},
"assignedToDepartmentId": {
"type": "integer",
"index": "analyzed"
},
"assignedToDepartmentName": {
"type": "string",
"analyzer": "keyword1"
},
"assignedToUserId": {
"type": "integer",
"index": "analyzed"
},
"assignedToUserName": {
"type": "string",
"analyzer": "keyword1"
},
"authorJsonMetadata": {
"properties": {
"favourites": {
"type": "string"
},
"followers": {
"type": "string"
},
"following": {
"type": "string"
},
"likes": {
"type": "string"
},
"listed": {
"type": "string"
},
"subscribers": {
"type": "string"
},
"subscription": {
"type": "string"
},
"uploads": {
"type": "string"
},
"views": {
"type": "string"
}
}
},
"authorKloutDetails": {
"dynamic": "true",
"properties": {
"amplificationScore": {
"type": "string"
},
"authorKloutDetailsFound": {
"type": "string"
},
"description": {
"type": "string"
},
"influencees": {
"dynamic": "true",
"properties": {
"kscore": {
"type": "string"
},
"twitter_screen_name": {
"type": "string"
}
}
},
"influencers": {
"dynamic": "true",
"properties": {
"kscore": {
"type": "string"
},
"twitter_screen_name": {
"type": "string"
}
}
},
"kloutClass": {
"type": "string"
},
"kloutClassDescription": {
"type": "string"
},
"kloutScore": {
"type": "string"
},
"kloutScoreDescription": {
"type": "string"
},
"kloutTopic": {
"type": "string"
},
"slope": {
"type": "string"
},
"trueReach": {
"type": "string"
},
"twitterId": {
"type": "string"
},
"twitterScreenName": {
"type":

...

read more »

Thanks Otis.

-Xmx is 1536m, that is the maximum that can be currently allocated in the
32 bit system. We are planning to bring in 64bit OS and 64 bit JVM, but,
that is in the pipeline for our production servers.
Hardware is available, but we were trying pin down on our flaws in the
configuration bit before investing in more resource.

The "phrases" field contains arrays of strings, analyzed with the keyword
analyzer, not normalized. We will surely try to reduce their cardinality,
as you suggested.

Ensuring this, we might raise our containment capacity, but will again hit
the ceiling with an increased number of documents.

What I am looking for is not taking precautions against hitting the
ceiling, but graceful recovery from within elasticsearch on facing that.
Something like aborting execution of the request on nearing the maximum
heap size, and communicating that in the response. Just like other
exceptions.

Regards,

On Thursday, May 3, 2012 8:02:20 AM UTC+5:30, Otis Gospodnetic wrote:

Sujoy,

Any suggestions?

Yes! Get more hardware! :wink: Or just increase -Xmx first.
But before you do that..... are you trying to facet on that "phrases"
field? What's in it? Are all phrases normalized (e.g. lowercased)? Do
whatever you can to reduce their cardinality.

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -
Sematext Monitoring | Infrastructure Monitoring Service

On Wednesday, May 2, 2012 3:45:38 AM UTC-4, Sujoy Sett wrote:

Thanks a lot Otis.

Regarding the bit on JVM recovery after OOM, following is the scenario:

  1. We hit term facet query on a *keyword analyzed string array field
    *, on 100000 document index.
  2. The node stats API shows a lot of ups and downs in the cache
    field size
    .
  3. Simultaneously message like [2012-05-02 12:59:59,008][INFO
    ][monitor.jvm] [es_node_67] [gc][ConcurrentMarkSweep][3928][150] duration
    [5.1s], collections [1]/[5.7s], total [5.1s]/[47.9s], memory
    [5.9gb]->[5.9gb]/[5.9gb]
    are printed on the ES console, several
    times.
  4. Result from the query still not available.
  5. After some more waiting, messages like [2012-05-02
    11:48:55,908][WARN ][transport.netty] [es_node_67] Exception caught on
    netty layer [[id: 0x425e2b6e, /0:0:0:0:0:0:0:0:55091 =>
    172.29.177.102:9300]] java.lang.OutOfMemoryError: loading field
    [phrases] caused out of memory failure
    comes up.
  6. During this no API calls generate any response. Even the node
    stats API remains dumb.
  7. After this non-responsive stage the .hprof file is generated, if
    heap dump parameter is set to true.
  8. ES recovers from the non responsive stage, performs as usual. But the
    main concern is - this automated recovery cycle takes a lot of time, and
    often fails in case of larger setup of multiple nodes. In that case we have
    to manually restart all the instances, which is the very thing we are
    trying to avoid
    .

Any suggestions?

Thanks and Regards,

On Wednesday, May 2, 2012 10:13:23 AM UTC+5:30, Otis Gospodnetic wrote:

Hi,

On Tuesday, May 1, 2012 1:56:59 PM UTC-4, Sujoy Sett wrote:

We were trying jmeter testing on elasticsearch queries that are being
used in our application. We ran single user as well as five concurrent user
tests via jmeter.
Following are the findings:

  1. Regarding the data sample that I posted early in the mail trail, and
    the kind of query I posted, a node of 2GB max heap size is being able to
    serve a query n 100000 data volume. On increasing the data volume, the node
    is facing OOM. My question is, will dividing the data into more
    shards, and adding more nodes (with same configuration), help me avoid
    hitting OOM?

Yes, adding more hardware will help. Which is not to say that you
need to add more hardware - it could be that you can avoid OOMs with your
existing hardware - maybe something can be tuned.

  1. I have used two configurations here, one - *multiple nodes in one

machine, with less heap space each node*. two - single node in one
machine, with more heap space
. Which one is better in terms of
concurrent requests, heavy requests (terms facets), and what is the best
shard configuration?

I'd go with the latter.

  1. Regarding recovery from OOM, elasticsearch is showing random
    behavior. We have switched off dumping heap to file. Still sometimes ES
    recovers from OOM, sometimes not. How to ensure avoidance of OOM from
    requests only? I mean something like when a query is causing a tending to
    OOM, identifying and aborting that query only, without making ES
    unresponsive.
    Does it sound absurd?

I think so, but maybe I'm missing something. I'm actually a bit
suspicious about the JVM recovering from an OOM. Can that really happen?
I don't think I've ever seen that happen - JVM OOMs when it really cannot
perform enough GC to make enough room for new objects. So I can't see how
the JVM could truly recover from OOM.

  1. Our ES installation has some 50 indexes in total. After a shutdown,
    it typically takes some 5-10 minutes to get the green state, and before
    that, queries tend to result in UnavailableShardException. *Can we
    control or speed up the recovery of some indexes on priority than others.

Yes. Have a look at this thread:
Redirecting to Google Groups

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -
Sematext Monitoring | Infrastructure Monitoring Service

Thanks,

On Saturday, April 28, 2012 11:31:45 PM UTC+5:30, Sujoy Sett wrote:

Hi,

One quick observation, when a single node is maintained for a
cluster, recovery from OOM is happening normally, though it is not that
fast.
But when the cluster is having two nodes, upon OOM the nodes are
coming to a standstill (no response available, CPU usage minimal, memory
blocked to maximum allowed size). On shutting down one node, the other is
returning to responsive state.
We changed multi-cast discovery to uni-cast, played a little with
discovery timeout parameters, with no avail.
What are we missing here, any suggestions?

Thanks and Regards,

On Friday, April 27, 2012 9:47:54 PM UTC+5:30, jagdeep singh wrote:

Hi Otis,

Thanks a lot for your response.
We will definitely try the approaches you have suggested and update
you soon.

Thanks and Regards
Jagdeep

On Apr 27, 9:12 pm, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

Hi Sujoy,

Say hi to Ian from Otis please :wink:

And about monitoring - we've used SPM for Elasticsearch to see and
understand behaviour of ES cache(s). Since we can see trend graphs
in SPM
for ES, we can see how the cache size changes when we run queries
vs. when
we use sort vs. when we facet on field X or X and Y, etc. And we
can see
that on the per-node basis, too. So having and seeing this data
over time
also helps with your "Just out of inquisitiveness, what is ES doing
internally?" question. :slight_smile:

You can also clear FieldCache for a given field and set TTL on it.
And since you mention using this for tag cloud, normalizing your
tags to
reduce their cardinality will also help. We just did all this
stuff for a
large client (tag normalization, soft cache, cache clearing,
adjustment of
field types to those that use less memory, etc.) and SPM for ES
came in
very handy, if I may say so! :slight_smile:

Otis

On Friday, April 27, 2012 6:42:35 AM UTC-4, Sujoy Sett wrote:

Hi,

Can u please explain how to check the field data cache ? Do I
have to set
anything to monitor explicitly?
I often use the mobz-elasticsearch-head-24935c4 plugin to monitor
cluster
state and health, I didn't find anything like
index.cache.field.max_size
there in the cluster_state details.

Thanks and Regards,

On Friday, April 27, 2012 3:52:04 PM UTC+5:30, Rafał Kuć wrote:

Hello,

Did you look at the size of the field data cache after
sending the
example query ?

Regards,
Rafał

W dniu piątek, 27 kwietnia 2012 12:15:38 UTC+2 użytkownik Sujoy
Sett
napisał:

Hi,

We have been using elasticsearch 0.19.2 for storing and
analyzing data
from social media blogs and forums. The data volume is going up
to
500000 documents per index, and size of this volume of data in
Elasticsearch index is going up to 3 GB per index per node (all
shards). We always maintain the number of replicas 1 less than
the
total number of nodes to ensure that a copy of all shards should
reside on every node at any instant. The number of shards are
generally 10 for the size of indexes we mentioned above.

We try different queries on these data for advanced
visualization
purpose, and mainly facets for showing trend charts or keyword
clouds.
Following are some example of the query we execute:
{
"query" : {
"match_all" : { }
},
"size" : 0,
"facets" : {
"tag" : {
"terms" : {
"field" : "nouns",
"size" : 100
},
"_cache":false
}
}
}

{
"query" : {
"match_all" : { }
},
"size" : 0,
"facets" : {
"tag" : {
"terms" : {
"field" : "phrases",
"size" : 100
},
"_cache":false
}
}
}

While executing such queries we often encounter heap space
shortage,
and the nodes becomes unresponsive. Our main concern is that
the nodes
do not recover to normal state even after dumping the heap to a
hprof
file. The node still consumes the maximum allocated memory as
shown in
task manager java.exe process, and the nodes remain
unresponsive until
we manually kill and restart them.

ES Configuration 1:
Elasticsearch Version 0.19.2
2 Nodes, one on each physical server
Max heap size 6GB per node.
10 shards, 1 replica.

ES Configuration 2:
Elasticsearch Version 0.19.2
6 Nodes, three on each physical server
Max heap size 2GB per node.
10 shards, 5 replica.

Server Configuration:
Windows 7 64 bit
64 bit JVM
8 GB pysical memory
Dual Core processor

For both the configuration mentioned above Elasticsearch was
unable to
respond to the facet queries mentioned above, it was also
unable to
recover when a query failed due to heap space shortage.

We are facing this issue in our production environments, and
request
you to please suggest a better configuration or a different
approach
if required.

The mapping of the data is we use is as follows:
(keyword1 is a customized keyword analyzer, similarly standard1
is a
customized standard analyzer)

{
"properties": {
"adjectives": {
"type": "string",
"analyzer": "stop2"
},
"alertStatus": {
"type": "string",
"analyzer": "keyword1"
},
"assignedByUserId": {
"type": "integer",
"index": "analyzed"
},
"assignedByUserName": {
"type": "string",
"analyzer": "keyword1"
},
"assignedToDepartmentId": {
"type": "integer",
"index": "analyzed"
},
"assignedToDepartmentName": {
"type": "string",
"analyzer": "keyword1"
},
"assignedToUserId": {
"type": "integer",
"index": "analyzed"
},
"assignedToUserName": {
"type": "string",
"analyzer": "keyword1"
},
"authorJsonMetadata": {
"properties": {
"favourites": {
"type": "string"
},
"followers": {
"type": "string"
},
"following": {
"type": "string"
},
"likes": {
"type": "string"
},
"listed": {
"type": "string"
},
"subscribers": {
"type": "string"
},
"subscription": {
"type": "string"
},
"uploads": {
"type": "string"
},
"views": {
"type": "string"
}
}
},
"authorKloutDetails": {
"dynamic": "true",
"properties": {
"amplificationScore": {
"type": "string"
},
"authorKloutDetailsFound": {
"type": "string"
},
"description": {
"type": "string"
},
"influencees": {
"dynamic": "true",
"properties": {
"kscore": {
"type": "string"
},
"twitter_screen_name": {
"type": "string"
}
}
},
"influencers": {
"dynamic": "true",
"properties": {
"kscore": {
"type": "string"
},
"twitter_screen_name": {
"type": "string"
}
}
},
"kloutClass": {
"type": "string"
},
"kloutClassDescription": {
"type": "string"
},
"kloutScore": {
"type": "string"
},
"kloutScoreDescription": {
"type": "string"
},
"kloutTopic": {
"type": "string"
},
"slope": {
"type": "string"
},
"trueReach": {
"type": "string"
},
"twitterId": {
"type": "string"
},
"twitterScreenName": {
"type":

...

read more »

Sujoy,

~3.5 GB of RAM is what you can allocate on a 32-bit system, not 1536m.
Actually, more than ~3.5 GB in some systems, see

But this graceful recovery or detecting the "watch out, heap almost full!"
situations .... I'm not sure that's really the way to go/doable.
Tuning, monitoring, alerting, and scaling horizontally.... that tends to
work.

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -

On Wednesday, May 2, 2012 11:15:14 PM UTC-4, Sujoy Sett wrote:

Thanks Otis.

-Xmx is 1536m, that is the maximum that can be currently allocated in the
32 bit system. We are planning to bring in 64bit OS and 64 bit JVM, but,
that is in the pipeline for our production servers.
Hardware is available, but we were trying pin down on our flaws in the
configuration bit before investing in more resource.

The "phrases" field contains arrays of strings, analyzed with the keyword
analyzer, not normalized. We will surely try to reduce their cardinality,
as you suggested.

Ensuring this, we might raise our containment capacity, but will again hit
the ceiling with an increased number of documents.

What I am looking for is not taking precautions against hitting the
ceiling, but graceful recovery from within elasticsearch on facing that.
Something like aborting execution of the request on nearing the maximum
heap size, and communicating that in the response. Just like other
exceptions.

Regards,

On Thursday, May 3, 2012 8:02:20 AM UTC+5:30, Otis Gospodnetic wrote:

Sujoy,

Any suggestions?

Yes! Get more hardware! :wink: Or just increase -Xmx first.
But before you do that..... are you trying to facet on that "phrases"
field? What's in it? Are all phrases normalized (e.g. lowercased)? Do
whatever you can to reduce their cardinality.

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -
Sematext Monitoring | Infrastructure Monitoring Service

On Wednesday, May 2, 2012 3:45:38 AM UTC-4, Sujoy Sett wrote:

Thanks a lot Otis.

Regarding the bit on JVM recovery after OOM, following is the scenario:

  1. We hit term facet query on a keyword analyzed string array
    field
    , on 100000 document index.
  2. The node stats API shows a lot of ups and downs in the cache
    field size
    .
  3. Simultaneously message like [2012-05-02 12:59:59,008][INFO
    ][monitor.jvm] [es_node_67] [gc][ConcurrentMarkSweep][3928][150] duration
    [5.1s], collections [1]/[5.7s], total [5.1s]/[47.9s], memory
    [5.9gb]->[5.9gb]/[5.9gb]
    are printed on the ES console, several
    times.
  4. Result from the query still not available.
  5. After some more waiting, messages like [2012-05-02
    11:48:55,908][WARN ][transport.netty] [es_node_67] Exception caught on
    netty layer [[id: 0x425e2b6e, /0:0:0:0:0:0:0:0:55091 =>
    172.29.177.102:9300]] java.lang.OutOfMemoryError: loading field
    [phrases] caused out of memory failure
    comes up.
  6. During this no API calls generate any response. Even the node
    stats API remains dumb.
  7. After this non-responsive stage the .hprof file is generated,
    if heap dump parameter is set to true.
  8. ES recovers from the non responsive stage, performs as usual. But
    the main concern is - this automated recovery cycle takes a lot of
    time, and often fails in case of larger setup of multiple nodes. In that
    case we have to manually restart all the instances, which is the very thing
    we are trying to avoid
    .

Any suggestions?

Thanks and Regards,

On Wednesday, May 2, 2012 10:13:23 AM UTC+5:30, Otis Gospodnetic wrote:

Hi,

On Tuesday, May 1, 2012 1:56:59 PM UTC-4, Sujoy Sett wrote:

We were trying jmeter testing on elasticsearch queries that are being
used in our application. We ran single user as well as five concurrent user
tests via jmeter.
Following are the findings:

  1. Regarding the data sample that I posted early in the mail trail,
    and the kind of query I posted, a node of 2GB max heap size is being able
    to serve a query n 100000 data volume. On increasing the data volume, the
    node is facing OOM. My question is, will dividing the data into more
    shards, and adding more nodes (with same configuration), help me avoid
    hitting OOM?

Yes, adding more hardware will help. Which is not to say that you
need to add more hardware - it could be that you can avoid OOMs with your
existing hardware - maybe something can be tuned.

  1. I have used two configurations here, one - *multiple nodes in one

machine, with less heap space each node*. two - single node in one
machine, with more heap space
. Which one is better in terms of
concurrent requests, heavy requests (terms facets), and what is the best
shard configuration?

I'd go with the latter.

  1. Regarding recovery from OOM, elasticsearch is showing random
    behavior. We have switched off dumping heap to file. Still sometimes ES
    recovers from OOM, sometimes not. How to ensure avoidance of OOM
    from requests only? I mean something like when a query is causing a tending
    to OOM, identifying and aborting that query only, without making ES
    unresponsive.
    Does it sound absurd?

I think so, but maybe I'm missing something. I'm actually a bit
suspicious about the JVM recovering from an OOM. Can that really happen?
I don't think I've ever seen that happen - JVM OOMs when it really cannot
perform enough GC to make enough room for new objects. So I can't see how
the JVM could truly recover from OOM.

  1. Our ES installation has some 50 indexes in total. After a shutdown,
    it typically takes some 5-10 minutes to get the green state, and before
    that, queries tend to result in UnavailableShardException. *Can we
    control or speed up the recovery of some indexes on priority than others.

Yes. Have a look at this thread:
Redirecting to Google Groups

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -
Sematext Monitoring | Infrastructure Monitoring Service

Thanks,

On Saturday, April 28, 2012 11:31:45 PM UTC+5:30, Sujoy Sett wrote:

Hi,

One quick observation, when a single node is maintained for a
cluster, recovery from OOM is happening normally, though it is not that
fast.
But when the cluster is having two nodes, upon OOM the nodes are
coming to a standstill (no response available, CPU usage minimal, memory
blocked to maximum allowed size). On shutting down one node, the other is
returning to responsive state.
We changed multi-cast discovery to uni-cast, played a little with
discovery timeout parameters, with no avail.
What are we missing here, any suggestions?

Thanks and Regards,

On Friday, April 27, 2012 9:47:54 PM UTC+5:30, jagdeep singh wrote:

Hi Otis,

Thanks a lot for your response.
We will definitely try the approaches you have suggested and update
you soon.

Thanks and Regards
Jagdeep

On Apr 27, 9:12 pm, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

Hi Sujoy,

Say hi to Ian from Otis please :wink:

And about monitoring - we've used SPM for Elasticsearch to see and
understand behaviour of ES cache(s). Since we can see trend
graphs in SPM
for ES, we can see how the cache size changes when we run queries
vs. when
we use sort vs. when we facet on field X or X and Y, etc. And we
can see
that on the per-node basis, too. So having and seeing this data
over time
also helps with your "Just out of inquisitiveness, what is ES doing
internally?" question. :slight_smile:

You can also clear FieldCache for a given field and set TTL on it.
And since you mention using this for tag cloud, normalizing your
tags to
reduce their cardinality will also help. We just did all this
stuff for a
large client (tag normalization, soft cache, cache clearing,
adjustment of
field types to those that use less memory, etc.) and SPM for ES
came in
very handy, if I may say so! :slight_smile:

Otis

On Friday, April 27, 2012 6:42:35 AM UTC-4, Sujoy Sett wrote:

Hi,

Can u please explain how to check the field data cache ? Do I
have to set
anything to monitor explicitly?
I often use the mobz-elasticsearch-head-24935c4 plugin to
monitor cluster
state and health, I didn't find anything like
index.cache.field.max_size
there in the cluster_state details.

Thanks and Regards,

On Friday, April 27, 2012 3:52:04 PM UTC+5:30, Rafał Kuć wrote:

Hello,

Did you look at the size of the field data cache after
sending the
example query ?

Regards,
Rafał

W dniu piątek, 27 kwietnia 2012 12:15:38 UTC+2 użytkownik Sujoy
Sett
napisał:

Hi,

We have been using elasticsearch 0.19.2 for storing and
analyzing data
from social media blogs and forums. The data volume is going
up to
500000 documents per index, and size of this volume of data in
Elasticsearch index is going up to 3 GB per index per node (all
shards). We always maintain the number of replicas 1 less than
the
total number of nodes to ensure that a copy of all shards
should
reside on every node at any instant. The number of shards are
generally 10 for the size of indexes we mentioned above.

We try different queries on these data for advanced
visualization
purpose, and mainly facets for showing trend charts or keyword
clouds.
Following are some example of the query we execute:
{
"query" : {
"match_all" : { }
},
"size" : 0,
"facets" : {
"tag" : {
"terms" : {
"field" : "nouns",
"size" : 100
},
"_cache":false
}
}
}

{
"query" : {
"match_all" : { }
},
"size" : 0,
"facets" : {
"tag" : {
"terms" : {
"field" : "phrases",
"size" : 100
},
"_cache":false
}
}
}

While executing such queries we often encounter heap space
shortage,
and the nodes becomes unresponsive. Our main concern is that
the nodes
do not recover to normal state even after dumping the heap to
a hprof
file. The node still consumes the maximum allocated memory as
shown in
task manager java.exe process, and the nodes remain
unresponsive until
we manually kill and restart them.

ES Configuration 1:
Elasticsearch Version 0.19.2
2 Nodes, one on each physical server
Max heap size 6GB per node.
10 shards, 1 replica.

ES Configuration 2:
Elasticsearch Version 0.19.2
6 Nodes, three on each physical server
Max heap size 2GB per node.
10 shards, 5 replica.

Server Configuration:
Windows 7 64 bit
64 bit JVM
8 GB pysical memory
Dual Core processor

For both the configuration mentioned above Elasticsearch was
unable to
respond to the facet queries mentioned above, it was also
unable to
recover when a query failed due to heap space shortage.

We are facing this issue in our production environments, and
request
you to please suggest a better configuration or a different
approach
if required.

The mapping of the data is we use is as follows:
(keyword1 is a customized keyword analyzer, similarly
standard1 is a
customized standard analyzer)

{
"properties": {
"adjectives": {
"type": "string",
"analyzer": "stop2"
},
"alertStatus": {
"type": "string",
"analyzer": "keyword1"
},
"assignedByUserId": {
"type": "integer",
"index": "analyzed"
},
"assignedByUserName": {
"type": "string",
"analyzer": "keyword1"
},
"assignedToDepartmentId": {
"type": "integer",
"index": "analyzed"
},
"assignedToDepartmentName": {
"type": "string",
"analyzer": "keyword1"
},
"assignedToUserId": {
"type": "integer",
"index": "analyzed"
},
"assignedToUserName": {
"type": "string",
"analyzer": "keyword1"
},
"authorJsonMetadata": {
"properties": {
"favourites": {
"type": "string"
},
"followers": {
"type": "string"
},
"following": {
"type": "string"
},
"likes": {
"type": "string"
},
"listed": {
"type": "string"
},
"subscribers": {
"type": "string"
},
"subscription": {
"type": "string"
},
"uploads": {
"type": "string"
},
"views": {
"type": "string"
}
}
},
"authorKloutDetails": {
"dynamic": "true",
"properties": {
"amplificationScore": {
"type": "string"
},
"authorKloutDetailsFound": {
"type": "string"
},
"description": {
"type": "string"
},
"influencees": {
"dynamic": "true",
"properties": {
"kscore": {
"type": "string"
},
"twitter_screen_name": {
"type": "string"
}
}
},
"influencers": {
"dynamic": "true",
"properties": {
"kscore": {
"type": "string"
},
"twitter_screen_name": {
"type": "string"
}
}
},
"kloutClass": {
"type": "string"
},
"kloutClassDescription": {
"type": "string"
},
"kloutScore": {
"type": "string"
},
"kloutScoreDescription": {
"type": "string"
},
"kloutTopic": {
"type": "string"
},
"slope": {
"type": "string"
},
"trueReach": {
"type": "string"
},
"twitterId": {
"type": "string"
},
"twitterScreenName": {
"type":

...

read more »

Hi,

Yes, that graceful recovery was the thing I was actually looking for. It
really becomes a pain when someone have to manually restart the ES on
production environments thrice a day to recover from OOM.

Few achievements so far:

Earlier we were setting nodes as a combination of master and data, i.e. the
default configuration of nodes acting both as master and data. We moved out
of that configuration, and now each physical server is installed with 1
instance of master node and 1 or more instances of data node. Due to such
configuration, some data node might get unresponsive due to OOM, but the
master node doesn't suffer the problem, and continue providing responses by
gathering results from other data nodes as available. In case of multiple
data node failure, results are provided with explicit OOM messages, which
is even better. Using the timeout property in search queries didn't respond
as expected, I don't know why, but provided it works, it will give the
perfect solution to our problem.

Can anyone please explain the detailed task division of master and data
nodes, and of the configuration of load-balancer node (neither master nor
data). We did not actually find much usage of it in accordance with our
setup.

We are currently working on the alerting and monitoring bit, by registering
the ES nodes as a windows service, and incorporating auto-recovery and
email-alerts in it.

Thanks,

On Thursday, May 3, 2012 10:19:45 AM UTC+5:30, Otis Gospodnetic wrote:

Sujoy,

~3.5 GB of RAM is what you can allocate on a 32-bit system, not 1536m.
Actually, more than ~3.5 GB in some systems, see
Physical Address Extension - Wikipedia

But this graceful recovery or detecting the "watch out, heap almost full!"
situations .... I'm not sure that's really the way to go/doable.
Tuning, monitoring, alerting, and scaling horizontally.... that tends to
work.

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -
Sematext Monitoring | Infrastructure Monitoring Service

On Wednesday, May 2, 2012 11:15:14 PM UTC-4, Sujoy Sett wrote:

Thanks Otis.

-Xmx is 1536m, that is the maximum that can be currently allocated in the
32 bit system. We are planning to bring in 64bit OS and 64 bit JVM, but,
that is in the pipeline for our production servers.
Hardware is available, but we were trying pin down on our flaws in the
configuration bit before investing in more resource.

The "phrases" field contains arrays of strings, analyzed with the keyword
analyzer, not normalized. We will surely try to reduce their cardinality,
as you suggested.

Ensuring this, we might raise our containment capacity, but will again
hit the ceiling with an increased number of documents.

What I am looking for is not taking precautions against hitting the
ceiling, but graceful recovery from within elasticsearch on facing that.
Something like aborting execution of the request on nearing the maximum
heap size, and communicating that in the response. Just like other
exceptions.

Regards,

On Thursday, May 3, 2012 8:02:20 AM UTC+5:30, Otis Gospodnetic wrote:

Sujoy,

Any suggestions?

Yes! Get more hardware! :wink: Or just increase -Xmx first.
But before you do that..... are you trying to facet on that "phrases"
field? What's in it? Are all phrases normalized (e.g. lowercased)? Do
whatever you can to reduce their cardinality.

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -
Sematext Monitoring | Infrastructure Monitoring Service

On Wednesday, May 2, 2012 3:45:38 AM UTC-4, Sujoy Sett wrote:

Thanks a lot Otis.

Regarding the bit on JVM recovery after OOM, following is the scenario:

  1. We hit term facet query on a keyword analyzed string array
    field
    , on 100000 document index.
  2. The node stats API shows a lot of ups and downs in the cache
    field size
    .
  3. Simultaneously message like [2012-05-02 12:59:59,008][INFO
    ][monitor.jvm] [es_node_67] [gc][ConcurrentMarkSweep][3928][150] duration
    [5.1s], collections [1]/[5.7s], total [5.1s]/[47.9s], memory
    [5.9gb]->[5.9gb]/[5.9gb]
    are printed on the ES console, several
    times.
  4. Result from the query still not available.
  5. After some more waiting, messages like [2012-05-02
    11:48:55,908][WARN ][transport.netty] [es_node_67] Exception caught on
    netty layer [[id: 0x425e2b6e, /0:0:0:0:0:0:0:0:55091 =>
    172.29.177.102:9300]] java.lang.OutOfMemoryError: loading field
    [phrases] caused out of memory failure
    comes up.
  6. During this no API calls generate any response. Even the node
    stats API remains dumb.
  7. After this non-responsive stage the .hprof file is generated,
    if heap dump parameter is set to true.
  8. ES recovers from the non responsive stage, performs as usual.
    But the main concern is - this automated recovery cycle takes a
    lot of time, and often fails in case of larger setup of multiple nodes. In
    that case we have to manually restart all the instances, which is the very
    thing we are trying to avoid
    .

Any suggestions?

Thanks and Regards,

On Wednesday, May 2, 2012 10:13:23 AM UTC+5:30, Otis Gospodnetic wrote:

Hi,

On Tuesday, May 1, 2012 1:56:59 PM UTC-4, Sujoy Sett wrote:

We were trying jmeter testing on elasticsearch queries that are being
used in our application. We ran single user as well as five concurrent user
tests via jmeter.
Following are the findings:

  1. Regarding the data sample that I posted early in the mail trail,
    and the kind of query I posted, a node of 2GB max heap size is being able
    to serve a query n 100000 data volume. On increasing the data volume, the
    node is facing OOM. My question is, will dividing the data into
    more shards, and adding more nodes (with same configuration), help me avoid
    hitting OOM?

Yes, adding more hardware will help. Which is not to say that you
need to add more hardware - it could be that you can avoid OOMs with your
existing hardware - maybe something can be tuned.

  1. I have used two configurations here, one - *multiple nodes in one

machine, with less heap space each node*. two - single node in one
machine, with more heap space
. Which one is better in terms of
concurrent requests, heavy requests (terms facets), and what is the best
shard configuration?

I'd go with the latter.

  1. Regarding recovery from OOM, elasticsearch is showing random
    behavior. We have switched off dumping heap to file. Still sometimes ES
    recovers from OOM, sometimes not. How to ensure avoidance of OOM
    from requests only? I mean something like when a query is causing a tending
    to OOM, identifying and aborting that query only, without making ES
    unresponsive.
    Does it sound absurd?

I think so, but maybe I'm missing something. I'm actually a bit
suspicious about the JVM recovering from an OOM. Can that really happen?
I don't think I've ever seen that happen - JVM OOMs when it really cannot
perform enough GC to make enough room for new objects. So I can't see how
the JVM could truly recover from OOM.

  1. Our ES installation has some 50 indexes in total. After a
    shutdown, it typically takes some 5-10 minutes to get the green state, and
    before that, queries tend to result in UnavailableShardException. *Can
    we control or speed up the recovery of some indexes on priority than others.

Yes. Have a look at this thread:
Redirecting to Google Groups

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -
Sematext Monitoring | Infrastructure Monitoring Service

Thanks,

On Saturday, April 28, 2012 11:31:45 PM UTC+5:30, Sujoy Sett wrote:

Hi,

One quick observation, when a single node is maintained for a
cluster, recovery from OOM is happening normally, though it is not that
fast.
But when the cluster is having two nodes, upon OOM the nodes are
coming to a standstill (no response available, CPU usage minimal, memory
blocked to maximum allowed size). On shutting down one node, the other is
returning to responsive state.
We changed multi-cast discovery to uni-cast, played a little with
discovery timeout parameters, with no avail.
What are we missing here, any suggestions?

Thanks and Regards,

On Friday, April 27, 2012 9:47:54 PM UTC+5:30, jagdeep singh wrote:

Hi Otis,

Thanks a lot for your response.
We will definitely try the approaches you have suggested and update
you soon.

Thanks and Regards
Jagdeep

On Apr 27, 9:12 pm, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

Hi Sujoy,

Say hi to Ian from Otis please :wink:

And about monitoring - we've used SPM for Elasticsearch to see and
understand behaviour of ES cache(s). Since we can see trend
graphs in SPM
for ES, we can see how the cache size changes when we run queries
vs. when
we use sort vs. when we facet on field X or X and Y, etc. And we
can see
that on the per-node basis, too. So having and seeing this data
over time
also helps with your "Just out of inquisitiveness, what is ES
doing
internally?" question. :slight_smile:

You can also clear FieldCache for a given field and set TTL on it.
And since you mention using this for tag cloud, normalizing your
tags to
reduce their cardinality will also help. We just did all this
stuff for a
large client (tag normalization, soft cache, cache clearing,
adjustment of
field types to those that use less memory, etc.) and SPM for ES
came in
very handy, if I may say so! :slight_smile:

Otis

On Friday, April 27, 2012 6:42:35 AM UTC-4, Sujoy Sett wrote:

Hi,

Can u please explain how to check the field data cache ? Do I
have to set
anything to monitor explicitly?
I often use the mobz-elasticsearch-head-24935c4 plugin to
monitor cluster
state and health, I didn't find anything like
index.cache.field.max_size
there in the cluster_state details.

Thanks and Regards,

On Friday, April 27, 2012 3:52:04 PM UTC+5:30, Rafał Kuć wrote:

Hello,

Did you look at the size of the field data cache after
sending the
example query ?

Regards,
Rafał

W dniu piątek, 27 kwietnia 2012 12:15:38 UTC+2 użytkownik
Sujoy Sett
napisał:

Hi,

We have been using elasticsearch 0.19.2 for storing and
analyzing data
from social media blogs and forums. The data volume is going
up to
500000 documents per index, and size of this volume of data in
Elasticsearch index is going up to 3 GB per index per node
(all
shards). We always maintain the number of replicas 1 less
than the
total number of nodes to ensure that a copy of all shards
should
reside on every node at any instant. The number of shards are
generally 10 for the size of indexes we mentioned above.

We try different queries on these data for advanced
visualization
purpose, and mainly facets for showing trend charts or
keyword clouds.
Following are some example of the query we execute:
{
"query" : {
"match_all" : { }
},
"size" : 0,
"facets" : {
"tag" : {
"terms" : {
"field" : "nouns",
"size" : 100
},
"_cache":false
}
}
}

{
"query" : {
"match_all" : { }
},
"size" : 0,
"facets" : {
"tag" : {
"terms" : {
"field" : "phrases",
"size" : 100
},
"_cache":false
}
}
}

While executing such queries we often encounter heap space
shortage,
and the nodes becomes unresponsive. Our main concern is that
the nodes
do not recover to normal state even after dumping the heap to
a hprof
file. The node still consumes the maximum allocated memory as
shown in
task manager java.exe process, and the nodes remain
unresponsive until
we manually kill and restart them.

ES Configuration 1:
Elasticsearch Version 0.19.2
2 Nodes, one on each physical server
Max heap size 6GB per node.
10 shards, 1 replica.

ES Configuration 2:
Elasticsearch Version 0.19.2
6 Nodes, three on each physical server
Max heap size 2GB per node.
10 shards, 5 replica.

Server Configuration:
Windows 7 64 bit
64 bit JVM
8 GB pysical memory
Dual Core processor

For both the configuration mentioned above Elasticsearch was
unable to
respond to the facet queries mentioned above, it was also
unable to
recover when a query failed due to heap space shortage.

We are facing this issue in our production environments, and
request
you to please suggest a better configuration or a different
approach
if required.

The mapping of the data is we use is as follows:
(keyword1 is a customized keyword analyzer, similarly
standard1 is a
customized standard analyzer)

{
"properties": {
"adjectives": {
"type": "string",
"analyzer": "stop2"
},
"alertStatus": {
"type": "string",
"analyzer": "keyword1"
},
"assignedByUserId": {
"type": "integer",
"index": "analyzed"
},
"assignedByUserName": {
"type": "string",
"analyzer": "keyword1"
},
"assignedToDepartmentId": {
"type": "integer",
"index": "analyzed"
},
"assignedToDepartmentName": {
"type": "string",
"analyzer": "keyword1"
},
"assignedToUserId": {
"type": "integer",
"index": "analyzed"
},
"assignedToUserName": {
"type": "string",
"analyzer": "keyword1"
},
"authorJsonMetadata": {
"properties": {
"favourites": {
"type": "string"
},
"followers": {
"type": "string"
},
"following": {
"type": "string"
},
"likes": {
"type": "string"
},
"listed": {
"type": "string"
},
"subscribers": {
"type": "string"
},
"subscription": {
"type": "string"
},
"uploads": {
"type": "string"
},
"views": {
"type": "string"
}
}
},
"authorKloutDetails": {
"dynamic": "true",
"properties": {
"amplificationScore": {
"type": "string"
},
"authorKloutDetailsFound": {
"type": "string"
},
"description": {
"type": "string"
},
"influencees": {
"dynamic": "true",
"properties": {
"kscore": {
"type": "string"
},
"twitter_screen_name": {
"type": "string"
}
}
},
"influencers": {
"dynamic": "true",
"properties": {
"kscore": {
"type": "string"
},
"twitter_screen_name": {
"type": "string"
}
}
},
"kloutClass": {
"type": "string"
},
"kloutClassDescription": {
"type": "string"
},
"kloutScore": {
"type": "string"
},
"kloutScoreDescription": {
"type": "string"
},
"kloutTopic": {
"type": "string"
},
"slope": {
"type": "string"
},
"trueReach": {
"type": "string"
},
"twitterId": {
"type": "string"
},
"twitterScreenName": {
"type":

...

read more »

Hi,

On Saturday, May 5, 2012 4:22:55 AM UTC-4, Sujoy Sett wrote:

Hi,

Yes, that graceful recovery was the thing I was actually looking for. It
really becomes a pain when someone have to manually restart the ES on
production environments thrice a day to recover from OOM.

I think one really just needs to set things up in a way that avoids OOM.
I don't think recovering from it is doable (if recovery were possible, it
would be done on the JVM level and we would never see OOM in the first
place), but I'd be curious to know if you figure something out.

Re monitoring and alerting, see URL in signature below. The installer is a
Bash script though, so if you run ES under Windows...

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -

Few achievements so far:

Earlier we were setting nodes as a combination of master and data, i.e.
the default configuration of nodes acting both as master and data. We moved
out of that configuration, and now each physical server is installed with 1
instance of master node and 1 or more instances of data node. Due to such
configuration, some data node might get unresponsive due to OOM, but the
master node doesn't suffer the problem, and continue providing responses by
gathering results from other data nodes as available. In case of multiple
data node failure, results are provided with explicit OOM messages, which
is even better. Using the timeout property in search queries didn't respond
as expected, I don't know why, but provided it works, it will give the
perfect solution to our problem.

Can anyone please explain the detailed task division of master and data
nodes, and of the configuration of load-balancer node (neither master nor
data). We did not actually find much usage of it in accordance with our
setup.

We are currently working on the alerting and monitoring bit, by
registering the ES nodes as a windows service, and incorporating
auto-recovery and email-alerts in it.

Thanks,

On Thursday, May 3, 2012 10:19:45 AM UTC+5:30, Otis Gospodnetic wrote:

Sujoy,

~3.5 GB of RAM is what you can allocate on a 32-bit system, not 1536m.
Actually, more than ~3.5 GB in some systems, see
Physical Address Extension - Wikipedia

But this graceful recovery or detecting the "watch out, heap almost
full!" situations .... I'm not sure that's really the way to go/doable.
Tuning, monitoring, alerting, and scaling horizontally.... that tends to
work.

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -
Sematext Monitoring | Infrastructure Monitoring Service

On Wednesday, May 2, 2012 11:15:14 PM UTC-4, Sujoy Sett wrote:

Thanks Otis.

-Xmx is 1536m, that is the maximum that can be currently allocated in
the 32 bit system. We are planning to bring in 64bit OS and 64 bit JVM,
but, that is in the pipeline for our production servers.
Hardware is available, but we were trying pin down on our flaws in the
configuration bit before investing in more resource.

The "phrases" field contains arrays of strings, analyzed with the
keyword analyzer, not normalized. We will surely try to reduce their
cardinality, as you suggested.

Ensuring this, we might raise our containment capacity, but will again
hit the ceiling with an increased number of documents.

What I am looking for is not taking precautions against hitting the
ceiling, but graceful recovery from within elasticsearch on facing that.
Something like aborting execution of the request on nearing the maximum
heap size, and communicating that in the response. Just like other
exceptions.

Regards,

On Thursday, May 3, 2012 8:02:20 AM UTC+5:30, Otis Gospodnetic wrote:

Sujoy,

Any suggestions?

Yes! Get more hardware! :wink: Or just increase -Xmx first.
But before you do that..... are you trying to facet on that "phrases"
field? What's in it? Are all phrases normalized (e.g. lowercased)? Do
whatever you can to reduce their cardinality.

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -
Sematext Monitoring | Infrastructure Monitoring Service

On Wednesday, May 2, 2012 3:45:38 AM UTC-4, Sujoy Sett wrote:

Thanks a lot Otis.

Regarding the bit on JVM recovery after OOM, following is the scenario:

  1. We hit term facet query on a keyword analyzed string array
    field
    , on 100000 document index.
  2. The node stats API shows a lot of ups and downs in the cache
    field size
    .
  3. Simultaneously message like [2012-05-02 12:59:59,008][INFO
    ][monitor.jvm] [es_node_67] [gc][ConcurrentMarkSweep][3928][150] duration
    [5.1s], collections [1]/[5.7s], total [5.1s]/[47.9s], memory
    [5.9gb]->[5.9gb]/[5.9gb]
    are printed on the ES console, several
    times.
  4. Result from the query still not available.
  5. After some more waiting, messages like [2012-05-02
    11:48:55,908][WARN ][transport.netty] [es_node_67] Exception caught on
    netty layer [[id: 0x425e2b6e, /0:0:0:0:0:0:0:0:55091 =>
    172.29.177.102:9300]] java.lang.OutOfMemoryError: loading field
    [phrases] caused out of memory failure
    comes up.
  6. During this no API calls generate any response. Even the node
    stats API remains dumb.
  7. After this non-responsive stage the .hprof file is generated,
    if heap dump parameter is set to true.
  8. ES recovers from the non responsive stage, performs as usual.
    But the main concern is - this automated recovery cycle takes a
    lot of time, and often fails in case of larger setup of multiple nodes. In
    that case we have to manually restart all the instances, which is the very
    thing we are trying to avoid
    .

Any suggestions?

Thanks and Regards,

On Wednesday, May 2, 2012 10:13:23 AM UTC+5:30, Otis Gospodnetic wrote:

Hi,

On Tuesday, May 1, 2012 1:56:59 PM UTC-4, Sujoy Sett wrote:

We were trying jmeter testing on elasticsearch queries that are
being used in our application. We ran single user as well as five
concurrent user tests via jmeter.
Following are the findings:

  1. Regarding the data sample that I posted early in the mail trail,
    and the kind of query I posted, a node of 2GB max heap size is being able
    to serve a query n 100000 data volume. On increasing the data volume, the
    node is facing OOM. My question is, will dividing the data into
    more shards, and adding more nodes (with same configuration), help me avoid
    hitting OOM?

Yes, adding more hardware will help. Which is not to say that you
need to add more hardware - it could be that you can avoid OOMs with your
existing hardware - maybe something can be tuned.

  1. I have used two configurations here, one - *multiple nodes in one

machine, with less heap space each node*. two - single node in one
machine, with more heap space
. Which one is better in terms of
concurrent requests, heavy requests (terms facets), and what is the best
shard configuration?

I'd go with the latter.

  1. Regarding recovery from OOM, elasticsearch is showing random
    behavior. We have switched off dumping heap to file. Still sometimes ES
    recovers from OOM, sometimes not. How to ensure avoidance of OOM
    from requests only? I mean something like when a query is causing a tending
    to OOM, identifying and aborting that query only, without making ES
    unresponsive.
    Does it sound absurd?

I think so, but maybe I'm missing something. I'm actually a bit
suspicious about the JVM recovering from an OOM. Can that really happen?
I don't think I've ever seen that happen - JVM OOMs when it really cannot
perform enough GC to make enough room for new objects. So I can't see how
the JVM could truly recover from OOM.

  1. Our ES installation has some 50 indexes in total. After a
    shutdown, it typically takes some 5-10 minutes to get the green state, and
    before that, queries tend to result in UnavailableShardException. *Can
    we control or speed up the recovery of some indexes on priority than others.

Yes. Have a look at this thread:
Redirecting to Google Groups

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -
Sematext Monitoring | Infrastructure Monitoring Service

Thanks,

On Saturday, April 28, 2012 11:31:45 PM UTC+5:30, Sujoy Sett wrote:

Hi,

One quick observation, when a single node is maintained for a
cluster, recovery from OOM is happening normally, though it is not that
fast.
But when the cluster is having two nodes, upon OOM the nodes are
coming to a standstill (no response available, CPU usage minimal, memory
blocked to maximum allowed size). On shutting down one node, the other is
returning to responsive state.
We changed multi-cast discovery to uni-cast, played a little with
discovery timeout parameters, with no avail.
What are we missing here, any suggestions?

Thanks and Regards,

On Friday, April 27, 2012 9:47:54 PM UTC+5:30, jagdeep singh wrote:

Hi Otis,

Thanks a lot for your response.
We will definitely try the approaches you have suggested and update
you soon.

Thanks and Regards
Jagdeep

On Apr 27, 9:12 pm, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

Hi Sujoy,

Say hi to Ian from Otis please :wink:

And about monitoring - we've used SPM for Elasticsearch to see
and
understand behaviour of ES cache(s). Since we can see trend
graphs in SPM
for ES, we can see how the cache size changes when we run
queries vs. when
we use sort vs. when we facet on field X or X and Y, etc. And
we can see
that on the per-node basis, too. So having and seeing this data
over time
also helps with your "Just out of inquisitiveness, what is ES
doing
internally?" question. :slight_smile:

You can also clear FieldCache for a given field and set TTL on
it.
And since you mention using this for tag cloud, normalizing your
tags to
reduce their cardinality will also help. We just did all this
stuff for a
large client (tag normalization, soft cache, cache clearing,
adjustment of
field types to those that use less memory, etc.) and SPM for ES
came in
very handy, if I may say so! :slight_smile:

Otis

On Friday, April 27, 2012 6:42:35 AM UTC-4, Sujoy Sett wrote:

Hi,

Can u please explain how to check the field data cache ? Do I
have to set
anything to monitor explicitly?
I often use the mobz-elasticsearch-head-24935c4 plugin to
monitor cluster
state and health, I didn't find anything like
index.cache.field.max_size
there in the cluster_state details.

Thanks and Regards,

On Friday, April 27, 2012 3:52:04 PM UTC+5:30, Rafał Kuć wrote:

Hello,

Did you look at the size of the field data cache after
sending the
example query ?

Regards,
Rafał

W dniu piątek, 27 kwietnia 2012 12:15:38 UTC+2 użytkownik
Sujoy Sett
napisał:

Hi,

We have been using elasticsearch 0.19.2 for storing and
analyzing data
from social media blogs and forums. The data volume is going
up to
500000 documents per index, and size of this volume of data
in
Elasticsearch index is going up to 3 GB per index per node
(all
shards). We always maintain the number of replicas 1 less
than the
total number of nodes to ensure that a copy of all shards
should
reside on every node at any instant. The number of shards are
generally 10 for the size of indexes we mentioned above.

We try different queries on these data for advanced
visualization
purpose, and mainly facets for showing trend charts or
keyword clouds.
Following are some example of the query we execute:
{
"query" : {
"match_all" : { }
},
"size" : 0,
"facets" : {
"tag" : {
"terms" : {
"field" : "nouns",
"size" : 100
},
"_cache":false
}
}
}

{
"query" : {
"match_all" : { }
},
"size" : 0,
"facets" : {
"tag" : {
"terms" : {
"field" : "phrases",
"size" : 100
},
"_cache":false
}
}
}

While executing such queries we often encounter heap space
shortage,
and the nodes becomes unresponsive. Our main concern is that
the nodes
do not recover to normal state even after dumping the heap
to a hprof
file. The node still consumes the maximum allocated memory
as shown in
task manager java.exe process, and the nodes remain
unresponsive until
we manually kill and restart them.

ES Configuration 1:
Elasticsearch Version 0.19.2
2 Nodes, one on each physical server
Max heap size 6GB per node.
10 shards, 1 replica.

ES Configuration 2:
Elasticsearch Version 0.19.2
6 Nodes, three on each physical server
Max heap size 2GB per node.
10 shards, 5 replica.

Server Configuration:
Windows 7 64 bit
64 bit JVM
8 GB pysical memory
Dual Core processor

For both the configuration mentioned above Elasticsearch was
unable to
respond to the facet queries mentioned above, it was also
unable to
recover when a query failed due to heap space shortage.

We are facing this issue in our production environments, and
request
you to please suggest a better configuration or a different
approach
if required.

The mapping of the data is we use is as follows:
(keyword1 is a customized keyword analyzer, similarly
standard1 is a
customized standard analyzer)

{
"properties": {
"adjectives": {
"type": "string",
"analyzer": "stop2"
},
"alertStatus": {
"type": "string",
"analyzer": "keyword1"
},
"assignedByUserId": {
"type": "integer",
"index": "analyzed"
},
"assignedByUserName": {
"type": "string",
"analyzer": "keyword1"
},
"assignedToDepartmentId": {
"type": "integer",
"index": "analyzed"
},
"assignedToDepartmentName": {
"type": "string",
"analyzer": "keyword1"
},
"assignedToUserId": {
"type": "integer",
"index": "analyzed"
},
"assignedToUserName": {
"type": "string",
"analyzer": "keyword1"
},
"authorJsonMetadata": {
"properties": {
"favourites": {
"type": "string"
},
"followers": {
"type": "string"
},
"following": {
"type": "string"
},
"likes": {
"type": "string"
},
"listed": {
"type": "string"
},
"subscribers": {
"type": "string"
},
"subscription": {
"type": "string"
},
"uploads": {
"type": "string"
},
"views": {
"type": "string"
}
}
},
"authorKloutDetails": {
"dynamic": "true",
"properties": {
"amplificationScore": {
"type": "string"
},
"authorKloutDetailsFound": {
"type": "string"
},
"description": {
"type": "string"
},
"influencees": {
"dynamic": "true",
"properties": {
"kscore": {
"type": "string"
},
"twitter_screen_name": {
"type": "string"
}
}
},
"influencers": {
"dynamic": "true",
"properties": {
"kscore": {
"type": "string"
},
"twitter_screen_name": {
"type": "string"
}
}
},
"kloutClass": {
"type": "string"
},
"kloutClassDescription": {
"type": "string"
},
"kloutScore": {
"type": "string"
},
"kloutScoreDescription": {
"type": "string"
},
"kloutTopic": {
"type": "string"
},
"slope": {
"type": "string"
},
"trueReach": {
"type": "string"
},
"twitterId": {
"type": "string"
},
"twitterScreenName": {
"type":

...

read more »

Hi Otis,

I am not able to use Sematext Monitoring | Infrastructure Monitoring Service because my application is
in secured network. Is there any desktop alternative for this where I
can install it and monitor ES servers.

Thanks and Regards
Jagdeep

On May 6, 9:44 am, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

Hi,

On Saturday, May 5, 2012 4:22:55 AM UTC-4, Sujoy Sett wrote:

Hi,

Yes, that graceful recovery was the thing I was actually looking for. It
really becomes a pain when someone have to manually restart the ES on
production environments thrice a day to recover from OOM.

I think one really just needs to set things up in a way that avoids OOM.
I don't think recovering from it is doable (if recovery were possible, it
would be done on the JVM level and we would never see OOM in the first
place), but I'd be curious to know if you figure something out.

Re monitoring and alerting, see URL in signature below. The installer is a
Bash script though, so if you run ES under Windows...

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -Sematext Monitoring | Infrastructure Monitoring Service

Few achievements so far:

Earlier we were setting nodes as a combination of master and data, i.e.
the default configuration of nodes acting both as master and data. We moved
out of that configuration, and now each physical server is installed with 1
instance of master node and 1 or more instances of data node. Due to such
configuration, some data node might get unresponsive due to OOM, but the
master node doesn't suffer the problem, and continue providing responses by
gathering results from other data nodes as available. In case of multiple
data node failure, results are provided with explicit OOM messages, which
is even better. Using the timeout property in search queries didn't respond
as expected, I don't know why, but provided it works, it will give the
perfect solution to our problem.

Can anyone please explain the detailed task division of master and data
nodes, and of the configuration of load-balancer node (neither master nor
data). We did not actually find much usage of it in accordance with our
setup.

We are currently working on the alerting and monitoring bit, by
registering the ES nodes as a windows service, and incorporating
auto-recovery and email-alerts in it.

Thanks,

On Thursday, May 3, 2012 10:19:45 AM UTC+5:30, Otis Gospodnetic wrote:

Sujoy,

~3.5 GB of RAM is what you can allocate on a 32-bit system, not 1536m.
Actually, more than ~3.5 GB in some systems, see
Physical Address Extension - Wikipedia

But this graceful recovery or detecting the "watch out, heap almost
full!" situations .... I'm not sure that's really the way to go/doable.
Tuning, monitoring, alerting, and scaling horizontally.... that tends to
work.

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -
Sematext Monitoring | Infrastructure Monitoring Service

On Wednesday, May 2, 2012 11:15:14 PM UTC-4, Sujoy Sett wrote:

Thanks Otis.

-Xmx is 1536m, that is the maximum that can be currently allocated in
the 32 bit system. We are planning to bring in 64bit OS and 64 bit JVM,
but, that is in the pipeline for our production servers.
Hardware is available, but we were trying pin down on our flaws in the
configuration bit before investing in more resource.

The "phrases" field contains arrays of strings, analyzed with the
keyword analyzer, not normalized. We will surely try to reduce their
cardinality, as you suggested.

Ensuring this, we might raise our containment capacity, but will again
hit the ceiling with an increased number of documents.

What I am looking for is not taking precautions against hitting the
ceiling, but graceful recovery from within elasticsearch on facing that.
Something like aborting execution of the request on nearing the maximum
heap size, and communicating that in the response. Just like other
exceptions.

Regards,

On Thursday, May 3, 2012 8:02:20 AM UTC+5:30, Otis Gospodnetic wrote:

Sujoy,

Any suggestions?

Yes! Get more hardware! :wink: Or just increase -Xmx first.
But before you do that..... are you trying to facet on that "phrases"
field? What's in it? Are all phrases normalized (e.g. lowercased)? Do
whatever you can to reduce their cardinality.

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -
Sematext Monitoring | Infrastructure Monitoring Service

On Wednesday, May 2, 2012 3:45:38 AM UTC-4, Sujoy Sett wrote:

Thanks a lot Otis.

Regarding the bit on JVM recovery after OOM, following is the scenario:

  1. We hit term facet query on a keyword analyzed string array
    field
    , on 100000 document index.
  2. The node stats API shows a lot of ups and downs in the cache
    field size
    .
  3. Simultaneously message like [2012-05-02 12:59:59,008][INFO
    ][monitor.jvm] [es_node_67] [gc][ConcurrentMarkSweep][3928][150] duration
    [5.1s], collections [1]/[5.7s], total [5.1s]/[47.9s], memory
    [5.9gb]->[5.9gb]/[5.9gb]
    are printed on the ES console, several
    times.
  4. Result from the query still not available.
  5. After some more waiting, messages like [2012-05-02
    11:48:55,908][WARN ][transport.netty] [es_node_67] Exception caught on
    netty layer [[id: 0x425e2b6e, /0:0:0:0:0:0:0:0:55091 =>
    172.29.177.102:9300]] java.lang.OutOfMemoryError: loading field
    [phrases] caused out of memory failure
    comes up.
  6. During this no API calls generate any response. Even the node
    stats API remains dumb.
  7. After this non-responsive stage the .hprof file is generated,
    if heap dump parameter is set to true.
  8. ES recovers from the non responsive stage, performs as usual.
    But the main concern is - this automated recovery cycle takes a
    lot of time, and often fails in case of larger setup of multiple nodes. In
    that case we have to manually restart all the instances, which is the very
    thing we are trying to avoid
    .

Any suggestions?

Thanks and Regards,

On Wednesday, May 2, 2012 10:13:23 AM UTC+5:30, Otis Gospodnetic wrote:

Hi,

On Tuesday, May 1, 2012 1:56:59 PM UTC-4, Sujoy Sett wrote:

We were trying jmeter testing on elasticsearch queries that are
being used in our application. We ran single user as well as five
concurrent user tests via jmeter.
Following are the findings:

  1. Regarding the data sample that I posted early in the mail trail,
    and the kind of query I posted, a node of 2GB max heap size is being able
    to serve a query n 100000 data volume. On increasing the data volume, the
    node is facing OOM. My question is, will dividing the data into
    more shards, and adding more nodes (with same configuration), help me avoid
    hitting OOM?

Yes, adding more hardware will help. Which is not to say that you
need to add more hardware - it could be that you can avoid OOMs with your
existing hardware - maybe something can be tuned.

  1. I have used two configurations here, one - *multiple nodes in one

machine, with less heap space each node*. two - single node in one
machine, with more heap space
. Which one is better in terms of
concurrent requests, heavy requests (terms facets), and what is the best
shard configuration?

I'd go with the latter.

  1. Regarding recovery from OOM, elasticsearch is showing random
    behavior. We have switched off dumping heap to file. Still sometimes ES
    recovers from OOM, sometimes not. How to ensure avoidance of OOM
    from requests only? I mean something like when a query is causing a tending
    to OOM, identifying and aborting that query only, without making ES
    unresponsive.
    Does it sound absurd?

I think so, but maybe I'm missing something. I'm actually a bit
suspicious about the JVM recovering from an OOM. Can that really happen?
I don't think I've ever seen that happen - JVM OOMs when it really cannot
perform enough GC to make enough room for new objects. So I can't see how
the JVM could truly recover from OOM.

  1. Our ES installation has some 50 indexes in total. After a
    shutdown, it typically takes some 5-10 minutes to get the green state, and
    before that, queries tend to result in UnavailableShardException. *Can
    we control or speed up the recovery of some indexes on priority than others.

Yes. Have a look at this thread:
Redirecting to Google Groups...

Otis

Performance Monitoring for Solr / Elasticsearch / HBase -
Sematext Monitoring | Infrastructure Monitoring Service

Thanks,

On Saturday, April 28, 2012 11:31:45 PM UTC+5:30, Sujoy Sett wrote:

Hi,

One quick observation, when a single node is maintained for a
cluster, recovery from OOM is happening normally, though it is not that
fast.
But when the cluster is having two nodes, upon OOM the nodes are
coming to a standstill (no response available, CPU usage minimal, memory
blocked to maximum allowed size). On shutting down one node, the other is
returning to responsive state.
We changed multi-cast discovery to uni-cast, played a little with
discovery timeout

...

read more »