Trying to find the reason of performance hiccup

Jae · September 25, 2012, 9:02pm

Hi

I am running 36 instances of elasticsearch-0.19.9. Around 11AM, one
instance was terminated because of AWS system check and it was replaced
immediately. After then, I can see 4 times performance hiccups. Please look
at the attached graph. There is a bounded queue from logger and
elasticsearch client. If the elasticsearch client is getting slower to be
responded from the servers, the queue will be filled up. Red line means
dropped messages and green line plummeted means the server was not handling
any traffic.

How can I trace it? Also, what should I do to prevent the performance
hiccup?

I was trying to find out what happened after 11AM, but I couldn't find
anything except connection time out to the dead instances. In the client
side, there were no failure, it was just a performance degradation. The
following is logging.yml.

Thank you
Best, Jae

rootLogger: INFO, console, file
logger:

log action execution errors for easier debugging

action: DEBUG

reduce the logging for aws, too much is logged under the default INFO

com.amazonaws: WARN

gateway

gateway: DEBUG
#index.gateway: DEBUG

peer shard recovery #indices.recovery: DEBUG

discovery

discovery: TRACE
index.search.slowlog: TRACE, index_search_slow_log_file

org.apache: WARN

additivity:
index.search.slowlog: false

appender:
console:
type: console
layout: type: consolePattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
file: type: dailyRollingFile
file: ${path.logs}/${cluster.name}.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

index_search_slow_log_file:
type: dailyRollingFile
file: ${path.logs}/${cluster.name}_index_search_slowlog.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

--

otisg · September 26, 2012, 3:31am

Hi,

What about other ES, JVM, or OS metrics - they may reveal the source.
Maybe shards were being moves around the cluster during those 4 times?
You can get this info from ES. We graph that in SPM and we found this very
informative when troubleshooting performance issues.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, September 25, 2012 5:02:13 PM UTC-4, Jae wrote:

Hi

I am running 36 instances of elasticsearch-0.19.9. Around 11AM, one
instance was terminated because of AWS system check and it was replaced
immediately. After then, I can see 4 times performance hiccups. Please look
at the attached graph. There is a bounded queue from logger and
elasticsearch client. If the elasticsearch client is getting slower to be
responded from the servers, the queue will be filled up. Red line means
dropped messages and green line plummeted means the server was not handling
any traffic.

How can I trace it? Also, what should I do to prevent the performance
hiccup?

I was trying to find out what happened after 11AM, but I couldn't find
anything except connection time out to the dead instances. In the client
side, there were no failure, it was just a performance degradation. The
following is logging.yml.

Thank you
Best, Jae

rootLogger: INFO, console, file
logger:

log action execution errors for easier debugging

action: DEBUG

reduce the logging for aws, too much is logged under the default INFO

com.amazonaws: WARN

gateway

gateway: DEBUG
#index.gateway: DEBUG

peer shard recovery #indices.recovery: DEBUG

discovery

discovery: TRACE
index.search.slowlog: TRACE, index_search_slow_log_file

org.apache: WARN

additivity:
index.search.slowlog: false

appender:
console:
type: console
layout: type: consolePattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
file: type: dailyRollingFile
file: ${path.logs}/${cluster.name}.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

index_search_slow_log_file:
type: dailyRollingFile
file: ${path.logs}/${cluster.name}_index_search_slowlog.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

--

ppearcy · September 27, 2012, 1:59am

If it coincided with a node leaving and joining the cluster it was most
likely due to the cold shards that came online. If you're doing sorts or
facets, they pull the entire dataset for that field into memory. I believe
this feature will mitigate:

github.com/elastic/elasticsearch

Index Warmup API

opened 03:49PM - 06 May 12 UTC

closed 03:50PM - 06 May 12 UTC

kimchy

>feature v0.20.0.RC1

Index warming allows to run registered search requests to warm up the index befo…re it is available for search. With the near real time aspect of search, cold data (segments) will be warmed up before they become available for search. Warmup searches typically include requests that require heavy loading of data, such as faceting or sorting on specific fields. The warmup APIs allows to register warmup (search) under specific names, remove them, and get them. Index warmup can be disabled by setting `index.warmer.enabled` to `false`. It is supported as a realtime setting using update settings API. This can be handy when doing initial bulk indexing, disabling pre registered warmers to make indexing faster and less expensive and then enable it. ## Put Warmer Allows to put a warmup search request on a specific index (or indices), with the body composing of a regular search request. Types can be provided as part of the URI if the search request is designed to be run only against the specific types. Here is an example that registers a warmup called `warmer_1` against index `test` (can be alias or several indices), for a search request that runs against all types: ``` curl -XPUT localhost:9200/test/_warmer/warmer_1 -d '{ "query" : { "match_all" : {} }, "facets" : { "facet_1" : { "terms" : { "field" : "field" } } } }' ``` And an example that registers a warmup against specific types: ``` curl -XPUT localhost:9200/test/type1/_warmer/warmer_1 -d '{ "query" : { "match_all" : {} }, "facets" : { "facet_1" : { "terms" : { "field" : "field" } } } }' ``` ## Delete Warmer Removing a warmer can be done against an index (or alias / indices) based on its name. The provided name can be a simple wildcard expression or omitted to remove all warmers. Some samples: ``` # delete warmer named warmer_1 on test index curl -XDELETE localhost:9200/test/_warmer/warmer_1 # delete all warmers that start with warm on test index curl -XDELETE localhost:9200/test/_warmer/warm* # delete all warmers for test index curl -XDELETE localhost:9200/test/_warmer/ ``` ## GETting Warmer Getting a warmer for specific index (or alias, or several indices) based on its name. The provided name can be a simple wildcard expression or omitted to get all warmers. Some examples: ``` # get warmer named warmer_1 on test index curl -XGET localhost:9200/test/_warmer/warmer_1 # get all warmers that start with warm on test index curl -XGET localhost:9200/test/_warmer/warm* # get all warmers for test index curl -XGET localhost:9200/test/_warmer/ ```

Best Regards,
Paul

On Tuesday, September 25, 2012 9:31:37 PM UTC-6, Otis Gospodnetic wrote:

Hi,

What about other ES, JVM, or OS metrics - they may reveal the source.
Maybe shards were being moves around the cluster during those 4 times?
You can get this info from ES. We graph that in SPM and we found this very
informative when troubleshooting performance issues.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, September 25, 2012 5:02:13 PM UTC-4, Jae wrote:

Hi

I am running 36 instances of elasticsearch-0.19.9. Around 11AM, one
instance was terminated because of AWS system check and it was replaced
immediately. After then, I can see 4 times performance hiccups. Please look
at the attached graph. There is a bounded queue from logger and
elasticsearch client. If the elasticsearch client is getting slower to be
responded from the servers, the queue will be filled up. Red line means
dropped messages and green line plummeted means the server was not handling
any traffic.

How can I trace it? Also, what should I do to prevent the performance
hiccup?

I was trying to find out what happened after 11AM, but I couldn't find
anything except connection time out to the dead instances. In the client
side, there were no failure, it was just a performance degradation. The
following is logging.yml.

Thank you
Best, Jae

rootLogger: INFO, console, file
logger:

log action execution errors for easier debugging

action: DEBUG

reduce the logging for aws, too much is logged under the default INFO

com.amazonaws: WARN

gateway

gateway: DEBUG
#index.gateway: DEBUG

peer shard recovery #indices.recovery: DEBUG

discovery

discovery: TRACE
index.search.slowlog: TRACE, index_search_slow_log_file

org.apache: WARN

additivity:
index.search.slowlog: false

appender:
console:
type: console
layout: type: consolePattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
file: type: dailyRollingFile
file: ${path.logs}/${cluster.name}.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

index_search_slow_log_file:
type: dailyRollingFile
file: ${path.logs}/${cluster.name}_index_search_slowlog.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

--