Node is getting hurt pretty bad from diagnostics information


(Ashwin Sathya) #1

Hi,

I have a 3 node cluster of XL VMs (14 Gig each, with 7G as ES_MAX_MEM) and i
have pushed over 1 billion documents ( around 15 indices with 1 shard 2
replica each. Please bear with the sharding model ) around 300G of data. I
noticed in HQ plugin that one of my machines was hurting bad. And decided to
take a peek.

The highlighted numbers are flagged as red in HQ.

ES Uptime:

1.90 days

1.91 days

1.89 days

CPU:

Opteron

Opteron

Opteron

Cores:

8

8

8

Store Size:

111.4gb

105.2gb

105.2gb

Documents:

416440194

416651034

416651034

Documents Deleted:

0%

0%

0%

Merge Size:

66.8gb

69.7gb

73.8gb

Merge Time:

10.9d

3h

3.4h

Merge Rate:

0.1 MB/s

6.7 MB/s

6.3 MB/s

File Descriptors:

-1

-1

-1

Index Activity

Indexing - Index:

30.98ms

0.32ms

0.33ms

Indexing - Delete:

0ms

0ms

0ms

Search - Query:

5699.39ms

30.44ms

29.19ms

Search - Fetch:

4996.11ms

16.99ms

15.04ms

Get - Total:

0ms

0ms

0ms

Get - Exists:

0ms

0ms

0ms

Get - Missing:

0ms

0ms

0ms

Refresh:

7366.72ms

64.32ms

71.41ms

Flush:

108949.93ms

1484.93ms

1613.02ms

Cache Activity

Field Size:

0b

0b

0b

Field Evictions:

0

0

0

Filter Cache Size:

25.7mb

29.1mb

30.3mb

Filter Evictions:

0 per query

0 per query

0 per query

ID Cache Size:

0b

0b

0b

% ID Cache:

0%

0%

0%

Memory

Total Memory:

14 gb

14 gb

14 gb

Heap Size:

2.8 gb

2.9 gb

2.9 gb

Heap % of RAM:

20.2%

20.7%

20.4%

% Heap Used:

66.2%

68.6%

59.2%

GC MarkSweep Frequency:

671 s

965 s

786 s

GC MarkSweep Duration:

970.69ms

39.81ms

36.12ms

GC ParNew Frequency:

2 s

3 s

2 s

GC ParNew Duration:

277.23ms

22.51ms

31.84ms

Swap Space:

4915.43359375 mb

4781.4609375 mb

4710.9921875 mb

Network

HTTP Connection Rate:

0 /second

0 /second

0 /second

I am also attaching a snippet of the log from the first machine in the list.
Seems something weird.

How do I make sense out of this ? What is going wrong ?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Clinton Gormley) #2

Hiya

You have swap turned on, which is killing your machines during garbage
collection. Turn off swap (either using bootstrap.mlockall or by just
disabling swap entirely).

Also, it looks like you are only querying the first node, which is probably
why it has to do more garbage collection than the other two nodes.

Clint

On 16 September 2013 22:57, R Ashwin Sathya ashwin.sathya@outlook.comwrote:


Hi,

I have a 3 node cluster of XL VMs (14 Gig each, with 7G as ES_MAX_MEM) and
i have pushed over 1 billion documents ( around 15 indices with 1 shard 2
replica each. Please bear with the sharding model ) around 300G of data. I
noticed in HQ plugin that one of my machines was hurting bad. And decided
to take a peek.****


The highlighted numbers are flagged as red in HQ.****


ES Uptime: ****

1.90 days ****

1.91 days ****

1.89 days ****

CPU: ****

Opteron ****

Opteron ****

Opteron ****

# Cores: ****

8 ****

8 ****

8 ****

Store Size: ****

111.4gb ****

105.2gb ****

105.2gb ****

# Documents: ****

416440194 ****

416651034 ****

416651034 ****

Documents Deleted: ****

0% ****

0% ****

0% ****

Merge Size: ****

66.8gb ****

69.7gb ****

73.8gb ****

Merge Time: ****

10.9d ****

3h ****

3.4h ****

Merge Rate: ****

0.1 MB/s ****

6.7 MB/s ****

6.3 MB/s ****

File Descriptors: ****

-1 ****

-1 ****

-1 ****

Index Activity

Indexing - Index: ****

30.98ms ****

0.32ms ****

0.33ms ****

Indexing - Delete: ****

0ms ****

0ms ****

0ms ****

Search - Query: ****

5699.39ms ****

30.44ms ****

29.19ms ****

Search - Fetch: ****

4996.11ms ****

16.99ms ****

15.04ms ****

Get - Total: ****

0ms ****

0ms ****

0ms ****

Get - Exists: ****

0ms ****

0ms ****

0ms ****

Get - Missing: ****

0ms ****

0ms ****

0ms ****

Refresh: ****

7366.72ms ****

64.32ms ****

71.41ms ****

Flush: ****

108949.93ms ****

1484.93ms ****

1613.02ms ****

Cache Activity

Field Size: ****

0b ****

0b ****

0b ****

Field Evictions: ****

0 ****

0 ****

0 ****

Filter Cache Size: ****

25.7mb ****

29.1mb ****

30.3mb ****

Filter Evictions: ****

0 per query ****

0 per query ****

0 per query ****

ID Cache Size: ****

0b ****

0b ****

0b ****

% ID Cache: ****

0% ****

0% ****

0% ****

Memory

Total Memory: ****

14 gb ****

14 gb ****

14 gb ****

Heap Size: ****

2.8 gb ****

2.9 gb ****

2.9 gb ****

Heap % of RAM: ****

20.2% ****

20.7% ****

20.4% ****

% Heap Used: ****

66.2% ****

68.6% ****

59.2% ****

GC MarkSweep Frequency: ****

671 s ****

965 s ****

786 s ****

GC MarkSweep Duration: ****

970.69ms ****

39.81ms ****

36.12ms ****

GC ParNew Frequency: ****

2 s ****

3 s ****

2 s ****

GC ParNew Duration: ****

277.23ms ****

22.51ms ****

31.84ms ****

Swap Space: ****

4915.43359375 mb ****

4781.4609375 mb ****

4710.9921875 mb ****

Network

HTTP Connection Rate: ****

0 /second ****

0 /second ****

0 /second****



I am also attaching a snippet of the log from the first machine in the
list. Seems something weird. ****


How do I make sense out of this ? What is going wrong ?****

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ashwin Sathya) #3

Thanks Clinton,

I will tryout the swap space related issue.

And you are correct from the fact that we are writing to one of the nodes
always, (I run on windows, and every node has an endpoint and I am looking
into alternate designs) but interestingly, the node which is spiked up is
not the one whose endpoint I am directing my writes to.

Seems strange. Any explanations ?

Thanks,

Ashwin Sathya

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com]
On Behalf Of Clinton Gormley
Sent: September 17, 2013 22:07
To: elasticsearch@googlegroups.com
Subject: Re: Node is getting hurt pretty bad from diagnostics information

Hiya

You have swap turned on, which is killing your machines during garbage
collection. Turn off swap (either using bootstrap.mlockall or by just
disabling swap entirely).

Also, it looks like you are only querying the first node, which is probably
why it has to do more garbage collection than the other two nodes.

Clint

On 16 September 2013 22:57, R Ashwin Sathya <ashwin.sathya@outlook.com
mailto:ashwin.sathya@outlook.com > wrote:

Hi,

I have a 3 node cluster of XL VMs (14 Gig each, with 7G as ES_MAX_MEM) and i
have pushed over 1 billion documents ( around 15 indices with 1 shard 2
replica each. Please bear with the sharding model ) around 300G of data. I
noticed in HQ plugin that one of my machines was hurting bad. And decided to
take a peek.

The highlighted numbers are flagged as red in HQ.

ES Uptime:

1.90 days

1.91 days

1.89 days

CPU:

Opteron

Opteron

Opteron

Cores:

8

8

8

Store Size:

111.4gb

105.2gb

105.2gb

Documents:

416440194

416651034

416651034

Documents Deleted:

0%

0%

0%

Merge Size:

66.8gb

69.7gb

73.8gb

Merge Time:

10.9d

3h

3.4h

Merge Rate:

0.1 MB/s

6.7 MB/s

6.3 MB/s

File Descriptors:

-1

-1

-1

Index Activity

Indexing - Index:

30.98ms

0.32ms

0.33ms

Indexing - Delete:

0ms

0ms

0ms

Search - Query:

5699.39ms

30.44ms

29.19ms

Search - Fetch:

4996.11ms

16.99ms

15.04ms

Get - Total:

0ms

0ms

0ms

Get - Exists:

0ms

0ms

0ms

Get - Missing:

0ms

0ms

0ms

Refresh:

7366.72ms

64.32ms

71.41ms

Flush:

108949.93ms

1484.93ms

1613.02ms

Cache Activity

Field Size:

0b

0b

0b

Field Evictions:

0

0

0

Filter Cache Size:

25.7mb

29.1mb

30.3mb

Filter Evictions:

0 per query

0 per query

0 per query

ID Cache Size:

0b

0b

0b

% ID Cache:

0%

0%

0%

Memory

Total Memory:

14 gb

14 gb

14 gb

Heap Size:

2.8 gb

2.9 gb

2.9 gb

Heap % of RAM:

20.2%

20.7%

20.4%

% Heap Used:

66.2%

68.6%

59.2%

GC MarkSweep Frequency:

671 s

965 s

786 s

GC MarkSweep Duration:

970.69ms

39.81ms

36.12ms

GC ParNew Frequency:

2 s

3 s

2 s

GC ParNew Duration:

277.23ms

22.51ms

31.84ms

Swap Space:

4915.43359375 mb

4781.4609375 mb

4710.9921875 mb

Network

HTTP Connection Rate:

0 /second

0 /second

0 /second

I am also attaching a snippet of the log from the first machine in the list.
Seems something weird.

How do I make sense out of this ? What is going wrong ?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com
mailto:elasticsearch%2Bunsubscribe@googlegroups.com .
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com
mailto:elasticsearch+unsubscribe@googlegroups.com .
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #4