Trying to find the reason of performance hiccup

Hi

I am running 36 instances of elasticsearch-0.19.9. Around 11AM, one
instance was terminated because of AWS system check and it was replaced
immediately. After then, I can see 4 times performance hiccups. Please look
at the attached graph. There is a bounded queue from logger and
elasticsearch client. If the elasticsearch client is getting slower to be
responded from the servers, the queue will be filled up. Red line means
dropped messages and green line plummeted means the server was not handling
any traffic.

How can I trace it? Also, what should I do to prevent the performance
hiccup?

I was trying to find out what happened after 11AM, but I couldn't find
anything except connection time out to the dead instances. In the client
side, there were no failure, it was just a performance degradation. The
following is logging.yml.

Thank you
Best, Jae


rootLogger: INFO, console, file
logger:

log action execution errors for easier debugging

action: DEBUG

reduce the logging for aws, too much is logged under the default INFO

com.amazonaws: WARN

gateway

gateway: DEBUG
#index.gateway: DEBUG

peer shard recovery #indices.recovery: DEBUG

discovery

discovery: TRACE
index.search.slowlog: TRACE, index_search_slow_log_file

org.apache: WARN

additivity:
index.search.slowlog: false

appender:
console:
type: console
layout: type: consolePattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
file: type: dailyRollingFile
file: ${path.logs}/${cluster.name}.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

index_search_slow_log_file:
type: dailyRollingFile
file: ${path.logs}/${cluster.name}_index_search_slowlog.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

--

Hi,

What about other ES, JVM, or OS metrics - they may reveal the source.
Maybe shards were being moves around the cluster during those 4 times?
You can get this info from ES. We graph that in SPM and we found this very
informative when troubleshooting performance issues.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, September 25, 2012 5:02:13 PM UTC-4, Jae wrote:

Hi

I am running 36 instances of elasticsearch-0.19.9. Around 11AM, one
instance was terminated because of AWS system check and it was replaced
immediately. After then, I can see 4 times performance hiccups. Please look
at the attached graph. There is a bounded queue from logger and
elasticsearch client. If the elasticsearch client is getting slower to be
responded from the servers, the queue will be filled up. Red line means
dropped messages and green line plummeted means the server was not handling
any traffic.

How can I trace it? Also, what should I do to prevent the performance
hiccup?

I was trying to find out what happened after 11AM, but I couldn't find
anything except connection time out to the dead instances. In the client
side, there were no failure, it was just a performance degradation. The
following is logging.yml.

Thank you
Best, Jae


rootLogger: INFO, console, file
logger:

log action execution errors for easier debugging

action: DEBUG

reduce the logging for aws, too much is logged under the default INFO

com.amazonaws: WARN

gateway

gateway: DEBUG
#index.gateway: DEBUG

peer shard recovery #indices.recovery: DEBUG

discovery

discovery: TRACE
index.search.slowlog: TRACE, index_search_slow_log_file

org.apache: WARN

additivity:
index.search.slowlog: false

appender:
console:
type: console
layout: type: consolePattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
file: type: dailyRollingFile
file: ${path.logs}/${cluster.name}.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

index_search_slow_log_file:
type: dailyRollingFile
file: ${path.logs}/${cluster.name}_index_search_slowlog.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

--

If it coincided with a node leaving and joining the cluster it was most
likely due to the cold shards that came online. If you're doing sorts or
facets, they pull the entire dataset for that field into memory. I believe
this feature will mitigate:

Best Regards,
Paul

On Tuesday, September 25, 2012 9:31:37 PM UTC-6, Otis Gospodnetic wrote:

Hi,

What about other ES, JVM, or OS metrics - they may reveal the source.
Maybe shards were being moves around the cluster during those 4 times?
You can get this info from ES. We graph that in SPM and we found this very
informative when troubleshooting performance issues.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, September 25, 2012 5:02:13 PM UTC-4, Jae wrote:

Hi

I am running 36 instances of elasticsearch-0.19.9. Around 11AM, one
instance was terminated because of AWS system check and it was replaced
immediately. After then, I can see 4 times performance hiccups. Please look
at the attached graph. There is a bounded queue from logger and
elasticsearch client. If the elasticsearch client is getting slower to be
responded from the servers, the queue will be filled up. Red line means
dropped messages and green line plummeted means the server was not handling
any traffic.

How can I trace it? Also, what should I do to prevent the performance
hiccup?

I was trying to find out what happened after 11AM, but I couldn't find
anything except connection time out to the dead instances. In the client
side, there were no failure, it was just a performance degradation. The
following is logging.yml.

Thank you
Best, Jae


rootLogger: INFO, console, file
logger:

log action execution errors for easier debugging

action: DEBUG

reduce the logging for aws, too much is logged under the default INFO

com.amazonaws: WARN

gateway

gateway: DEBUG
#index.gateway: DEBUG

peer shard recovery #indices.recovery: DEBUG

discovery

discovery: TRACE
index.search.slowlog: TRACE, index_search_slow_log_file

org.apache: WARN

additivity:
index.search.slowlog: false

appender:
console:
type: console
layout: type: consolePattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
file: type: dailyRollingFile
file: ${path.logs}/${cluster.name}.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

index_search_slow_log_file:
type: dailyRollingFile
file: ${path.logs}/${cluster.name}_index_search_slowlog.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

--