High HD utilization

Eugene_Strokin · December 10, 2012, 10:32pm

On production, I've slowly, but constantly growing index base. Now it's
about 6M documents few Kb each. The index size is about 6Gb.
Today server became almost unusable. I see high hard drive use by ES
process.
I've restarted the server, was trying to adjust settings which allocate
memory, same problem. Sometimes server works just fine, but most of the
time it barely respond.
dstat --aio --io --disk --tcp --top-io-adv --top-bio-adv
shows such output:
async --io/total- -dsk/total- ----tcp-sockets---- -------most-expensive-i/o-
process------- ----most-expensive-block-i/o-process----
#aio| read writ| read writ|lis act syn tim clo|process pid
read write cpu|process pid read write cpu
0 | 256 0 |2816k 0 | 12 2 2 1 46|java 18891
16M 408k 13%|java 188912592k 0 13%
0 | 219 0 |2220k 0 | 12 2 4 1 45|java 18891
18M 365k8.2%|java 188912352k4096B8.2%
0 | 212 2.00 |2624k 4096B| 12 2 1 1 45|java 18891
24M 280k 21%|java 188913212k 0 21%
0 | 218 0 |2896k 0 | 12 2 0 1 48|java 18891
24M 293k 29%|java 188912392k 0 29%
0 | 233 0 |2712k 0 | 12 2 1 1 50|java 18891
43M 434k 46%|java 188912552k 0 46%
0 | 198 2.00 |2008k 4096B| 12 2 3 1 47|java 18891
25M 547k 33%|java 188911736k4096B 33%
0 | 223 0 |2304k 0 | 12 2 3 1 43|java 18891
26M 440k 31%|java 188912048k 0 31%
0 | 196 16.0 |2440k 324k| 12 2 1 1 40|java 18891
49M 444k 39%|java 188912812k 84k 39%

18891 is ES process.
top shows
Cpu(s): 16.0%us, 0.5%sy, 0.0%ni, 0.4%id, 83.1%wa, 0.0%hi, 0.1%si,
0.0%st
%wa is very high constantly.

Could you please advice, what could be wrong and how it could be fixed?
Thank you

--

Eugene_Strokin · December 10, 2012, 11:30pm

I don't know if it helps, but here is another output of
vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu

r b swpd free buff cache si so bi bo in cs us sy id
wa st
1 42 13940 127568 20080 4056444 0 0 187 87 0 0 19 1 60
19 0
1 103 13940 136020 20068 4026936 0 0 2274 7 5317 7494 36 1 1
62 0
2 89 13940 125392 20076 4032684 0 0 2178 10 3978 7289 18 0 1
81 0
1 101 13940 136032 20088 4018004 0 0 1795 90 4292 7702 19 1 2
78 0
1 100 13980 275952 20020 3871152 0 8 2020 70 4621 7690 22 1 0
77 0
1 110 13980 252588 20032 3881604 0 0 2566 45 5080 8707 24 1 1
75 0
4 90 13980 240628 20052 3887184 0 0 2228 45 4162 7793 20 1 0
79 0

It clearly shows that no swapping takes place, but there is a lot of
reading form HD, and some writing constantly happen raising wa numbers to
the top.
Thank you

On Monday, December 10, 2012 5:32:12 PM UTC-5, Eugene Strokin wrote:

On production, I've slowly, but constantly growing index base. Now it's
about 6M documents few Kb each. The index size is about 6Gb.
Today server became almost unusable. I see high hard drive use by ES
process.
I've restarted the server, was trying to adjust settings which allocate
memory, same problem. Sometimes server works just fine, but most of the
time it barely respond.
dstat --aio --io --disk --tcp --top-io-adv --top-bio-adv
shows such output:
async --io/total- -dsk/total- ----tcp-sockets---- -------most-expensive-i/
o-process------- ----most-expensive-block-i/o-process----
#aio| read writ| read writ|lis act syn tim clo|process
pid read write cpu|process pid read write cpu
0 | 256 0 |2816k 0 | 12 2 2 1 46|java
18891 16M 408k 13%|java 188912592k 0 13%
0 | 219 0 |2220k 0 | 12 2 4 1 45|java
18891 18M 365k8.2%|java 188912352k4096B8.2%
0 | 212 2.00 |2624k 4096B| 12 2 1 1 45|java
18891 24M 280k 21%|java 188913212k 0 21%
0 | 218 0 |2896k 0 | 12 2 0 1 48|java
18891 24M 293k 29%|java 188912392k 0 29%
0 | 233 0 |2712k 0 | 12 2 1 1 50|java
18891 43M 434k 46%|java 188912552k 0 46%
0 | 198 2.00 |2008k 4096B| 12 2 3 1 47|java
18891 25M 547k 33%|java 188911736k4096B 33%
0 | 223 0 |2304k 0 | 12 2 3 1 43|java
18891 26M 440k 31%|java 188912048k 0 31%
0 | 196 16.0 |2440k 324k| 12 2 1 1 40|java
18891 49M 444k 39%|java 188912812k 84k 39%

18891 is ES process.
top shows
Cpu(s): 16.0%us, 0.5%sy, 0.0%ni, 0.4%id, 83.1%wa, 0.0%hi, 0.1%si,
0.0%st
%wa is very high constantly.

Could you please advice, what could be wrong and how it could be fixed?
Thank you

--

Eugene_Strokin · December 10, 2012, 11:47pm

And finally
iostat -x 5
Linux 2.6.32-220.4.2.el6.x86_64 (host000) 12/10/2012 x86_64 (8 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
19.30 0.00 1.14 19.06 0.00 60.49

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz
avgqu-sz await svctm %util
sdb 22.68 6.32 17.29 1.99 635.50 66.43 36.41
0.00 11.98 4.66 8.98
sda 68.52 156.72 124.12 11.72 2387.68 1342.44 27.46
0.16 1.14 0.95 12.88

avg-cpu: %user %nice %system %iowait %steal %idle
18.61 0.00 0.65 76.03 0.00 4.71

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz
avgqu-sz await svctm %util
sdb 0.40 0.00 237.00 0.00 4531.20 0.00 19.12
88.58 351.48 4.22 100.00
sda 0.00 1.60 0.00 0.40 0.00 16.00 40.00
0.00 11.00 11.00 0.44

avg-cpu: %user %nice %system %iowait %steal %idle
19.07 0.00 0.75 72.92 0.00 7.27

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz
avgqu-sz await svctm %util
sdb 0.80 127.60 218.00 7.00 4065.60 144.00 18.71
101.12 432.54 4.44 100.00
sda 0.00 3.00 0.00 0.80 0.00 30.40 38.00
0.01 8.25 8.25 0.66

avg-cpu: %user %nice %system %iowait %steal %idle
16.56 0.00 0.70 81.43 0.00 1.30

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz
avgqu-sz await svctm %util
sdb 0.00 9.80 256.00 3.80 5028.80 156.80 19.96
150.32 492.85 3.85 100.00
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00

it also shows that ES (the only process which is using sdb ) utilizing 100%
of the disk.
My initial thought was that hard drive is getting corrupted. And I moved ES
to the fresh sdb drive, which wasn't be used by anything. And once I
started ES I see utilization jumps on it right off.
Another strange thing, that nothing special was done before the problem
appear. Standard maintenance routing. Some minor mapping update few days
before. Yesterday it was working just fine, and today, it shows the
problem. All I'm thinking of, is just the index size reached some kind of
limit.

Please help, I'm desperate here, any help would be greatly appreciated,
even some temporary solution would help. Our production is down for the
whole day.
Thank you

On Monday, December 10, 2012 6:30:11 PM UTC-5, Eugene Strokin wrote:

I don't know if it helps, but here is another output of
vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----
cpu-----
r b swpd free buff cache si so bi bo in cs us sy
id wa st
1 42 13940 127568 20080 4056444 0 0 187 87 0 0 19 1
60 19 0
1 103 13940 136020 20068 4026936 0 0 2274 7 5317 7494 36 1
1 62 0
2 89 13940 125392 20076 4032684 0 0 2178 10 3978 7289 18 0
1 81 0
1 101 13940 136032 20088 4018004 0 0 1795 90 4292 7702 19 1
2 78 0
1 100 13980 275952 20020 3871152 0 8 2020 70 4621 7690 22 1
0 77 0
1 110 13980 252588 20032 3881604 0 0 2566 45 5080 8707 24 1
1 75 0
4 90 13980 240628 20052 3887184 0 0 2228 45 4162 7793 20 1
0 79 0

It clearly shows that no swapping takes place, but there is a lot of
reading form HD, and some writing constantly happen raising wa numbers to
the top.
Thank you

On Monday, December 10, 2012 5:32:12 PM UTC-5, Eugene Strokin wrote:

On production, I've slowly, but constantly growing index base. Now it's
about 6M documents few Kb each. The index size is about 6Gb.
Today server became almost unusable. I see high hard drive use by ES
process.
I've restarted the server, was trying to adjust settings which allocate
memory, same problem. Sometimes server works just fine, but most of the
time it barely respond.
dstat --aio --io --disk --tcp --top-io-adv --top-bio-adv
shows such output:
async --io/total- -dsk/total- ----tcp-sockets---- -------most-expensive-i
/o-process------- ----most-expensive-block-i/o-process----
#aio| read writ| read writ|lis act syn tim clo|process
pid read write cpu|process pid read write cpu
0 | 256 0 |2816k 0 | 12 2 2 1 46|java
18891 16M 408k 13%|java 188912592k 0 13%
0 | 219 0 |2220k 0 | 12 2 4 1 45|java
18891 18M 365k8.2%|java 188912352k4096B8.2%
0 | 212 2.00 |2624k 4096B| 12 2 1 1 45|java
18891 24M 280k 21%|java 188913212k 0 21%
0 | 218 0 |2896k 0 | 12 2 0 1 48|java
18891 24M 293k 29%|java 188912392k 0 29%
0 | 233 0 |2712k 0 | 12 2 1 1 50|java
18891 43M 434k 46%|java 188912552k 0 46%
0 | 198 2.00 |2008k 4096B| 12 2 3 1 47|java
18891 25M 547k 33%|java 188911736k4096B 33%
0 | 223 0 |2304k 0 | 12 2 3 1 43|java
18891 26M 440k 31%|java 188912048k 0 31%
0 | 196 16.0 |2440k 324k| 12 2 1 1 40|java
18891 49M 444k 39%|java 188912812k 84k 39%

18891 is ES process.
top shows
Cpu(s): 16.0%us, 0.5%sy, 0.0%ni, 0.4%id, 83.1%wa, 0.0%hi, 0.1%si,
0.0%st
%wa is very high constantly.

Could you please advice, what could be wrong and how it could be fixed?
Thank you

--

Kay_Ropke · December 11, 2012, 12:03am

On 11.12.2012, at 00:47, Eugene Strokin wrote:

it also shows that ES (the only process which is using sdb ) utilizing 100% of the disk.
My initial thought was that hard drive is getting corrupted. And I moved ES to the fresh sdb drive, which wasn't be used by anything. And once I started ES I see utilization jumps on it right off.

can you elaborate on the query/indexing load a bit, as well as the server config?
a couple million docs and ~ 6gb index size doesn't sound extraordinary to me, but without knowing what system(s) this runs on it's hard to tell.

Another strange thing, that nothing special was done before the problem appear. Standard maintenance routing. Some minor mapping update few days before. Yesterday it was working just fine, and today, it shows the problem. All I'm thinking of, is just the index size reached some kind of limit.

the only thing that i immediately think of are field caches. so if you have changed the default settings, and do lots of faceting/sorting, that might explain a lot of IO pressure.
another thing to look at would be the refresh interval, but without any data, it's guesswork.

Please help, I'm desperate here, any help would be greatly appreciated, even some temporary solution would help. Our production is down for the whole day.
Thank you

cheers,
-k

--

jprante · December 11, 2012, 12:24am

Hi,

I can see, you are using EL 6 but not much more.

Can you give more information about

the JVM version you use
the workload running during the incident (indexing data volume, search)?
the disk configuration: how many disks, RAID, filesystem, mount flags?
how many nodes?

It is also helpful to provide JVM thread dump and heap usage stats (e.g. by
bigdesk), and diagnostics of jvisualvm or jstack.

Best regards,

Jörg

--

Eugene_Strokin · December 11, 2012, 12:52am

Thanks for reply,
here is more info about the server:

The ES has about 100-150 search requests per second now. It had about 10
times more and worked fine. It has about 6-10 indexing requests per second.
Also had more before. Basically performance degradation is about 7-10 times.

Whole ES server is running on a single box, with default configuration +
custom mappings.

The server has 8Gb RAM.

3Gb for ES, 2Gb for application server (running on a separate HD now), the
rest is for system.

VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Sun Microsystems Inc.
VM version: 20.0-b11

Java version: 1.6.0_25

There are 2 disks. No RAID. Everything was running on the same disk from
beginning. After the ecsident I've moved ES to a separate HD.

hdparm -i /dev/sdb

/dev/sdb

Model=WDC, FwRev=01.01S01, SerialNo=WD-WMAYP0993564

Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }

RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=50

BuffType=unknown, BuffSize=unknown, MaxMultSect=16, MultSect=16

CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=976773168

IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}

PIO modes: pio0 pio3 pio4

DMA modes: mdma0 mdma1 mdma2

UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6

AdvancedPM=yes: unknown setting WriteCache=enabled

Drive conforms to: Unspecified: ATA/ATAPI-1,2,3,4,5,6,7

signifies the current active mode

I'll try to fetch the rest of information as soon as I can

Thank you

On Monday, December 10, 2012 7:24:14 PM UTC-5, Jörg Prante wrote:

Hi,

I can see, you are using EL 6 but not much more.

Can you give more information about

the JVM version you use

the workload running during the incident (indexing data volume, search)?

the disk configuration: how many disks, RAID, filesystem, mount flags?

how many nodes?

It is also helpful to provide JVM thread dump and heap usage stats (e.g.
by bigdesk), and diagnostics of jvisualvm or jstack.

Best regards,

Jörg

--

Eugene_Strokin · December 11, 2012, 1:02am

All what was changed is a multivalued field was added to a mapping of one
type of document. My g. There are facet search and a lot of sorting. But
that wasn't changed. It worked Ok. I saw some load on HD before, but it was
nether 100%.
I don't believe I've changed refresh interval, frankly, I don't even know
where to change it or how to check the current value.

Thank you,
Eugene

On Monday, December 10, 2012 7:03:28 PM UTC-5, Kay Röpke wrote:

On 11.12.2012, at 00:47, Eugene Strokin wrote:

it also shows that ES (the only process which is using sdb ) utilizing
100% of the disk.
My initial thought was that hard drive is getting corrupted. And I moved
ES to the fresh sdb drive, which wasn't be used by anything. And once I
started ES I see utilization jumps on it right off.

can you elaborate on the query/indexing load a bit, as well as the server
config?
a couple million docs and ~ 6gb index size doesn't sound extraordinary to
me, but without knowing what system(s) this runs on it's hard to tell.

Another strange thing, that nothing special was done before the problem
appear. Standard maintenance routing. Some minor mapping update few days
before. Yesterday it was working just fine, and today, it shows the
problem. All I'm thinking of, is just the index size reached some kind of
limit.

the only thing that i immediately think of are field caches. so if you
have changed the default settings, and do lots of faceting/sorting, that
might explain a lot of IO pressure.
another thing to look at would be the refresh interval, but without any
data, it's guesswork.

Please help, I'm desperate here, any help would be greatly appreciated,
even some temporary solution would help. Our production is down for the
whole day.
Thank you

cheers,
-k

--

jprante · December 11, 2012, 1:33am

Just some general hints.

On a single box, there is no shard relocation I/O, so this can not be the
reason. You have only s single disk, which may become a bottleneck in heavy
load situations.

Usually, ES / Lucene does utilize disk I/O in the following situations:
while indexing, when Lucene segments grow and reach a critical size, a
massive merging process starts sooner or later. The larger the segments are
allowed to grow, the bigger the disk I/O utilization.

On the other hand, while searching, even with high query per seconds, there
will rarely be a utilization of a disk by 100%. You can verify this: are
there any "bad" queries? Can it be that queries can offload thousands,
millions of documents? Can you observe high network traffic to the search
clients?

How to decrease the sudden disk I/O load jump while segment merging? You
have several options:

tune the segment merging, see
process http://www.elasticsearch.org/guide/reference/index-modules/merge.html
shorten the merge time by scaling your hardware (more disks, faster disks)
tune the disk I/O throughput of EL 6, by mounting ext4 with "noatime"
flag, also high-performance RAID for pararallel write might be helpful, or
striping ES data directories over more than one disk, see
also http://www.elasticsearch.org/guide/reference/setup/dir-layout.html
and Java 7 may also help a little bit, to gain most advantage from the
latest Java NIO subsystem developments, since Java 1.6.0_25 is quite aged.

Jörg

--

Eugene_Strokin · December 11, 2012, 1:52am

Thank you,...
in a mean time, I've restored yesterday's backup of ES, which worked fine,
and run the Prod on it.
It works better, but I still see 100% HD utilization. I wander if the
problem could appear long before, but somehow it wasn't noticeable enough.
Also it shows that it almost doesn't use No-Heap memory (see attachment):

https://lh3.googleusercontent.com/-5WFNwI7bDyE/UMaRRooq4BI/AAAAAAAAAMk/FvWMmGRK0VU/s1600/Screen+Shot+2012-12-10+at+8.39.53+PM.png
Only 41Mb is used. I thought it should be utilized by cache. May be I'm
mistaken.
Here is another stat from BigDesk:

https://lh5.googleusercontent.com/-U4NHfyO2K2E/UMaRiCUCf7I/AAAAAAAAAMs/UJ0SYk4en1w/s1600/Screen+Shot+2012-12-10+at+8.41.20+PM.png

And more related to search/indexing load

https://lh3.googleusercontent.com/-xLSqHAoBXL8/UMaRsuzm2EI/AAAAAAAAAM0/S8-nO1mGhQc/s1600/Screen+Shot+2012-12-10+at+8.41.44+PM.png

Looks like nothing wrong,.. but HD 100% utilization still hiting the
performance.

I'll follow the suggestions you provided, and will keep you posted.

Thank you

On Monday, December 10, 2012 8:33:58 PM UTC-5, Jörg Prante wrote:

Just some general hints.

On a single box, there is no shard relocation I/O, so this can not be the
reason. You have only s single disk, which may become a bottleneck in heavy
load situations.

Usually, ES / Lucene does utilize disk I/O in the following situations:
while indexing, when Lucene segments grow and reach a critical size, a
massive merging process starts sooner or later. The larger the segments are
allowed to grow, the bigger the disk I/O utilization.

On the other hand, while searching, even with high query per seconds,
there will rarely be a utilization of a disk by 100%. You can verify this:
are there any "bad" queries? Can it be that queries can offload thousands,
millions of documents? Can you observe high network traffic to the search
clients?

How to decrease the sudden disk I/O load jump while segment merging? You
have several options:

tune the segment merging, see process
Elasticsearch Platform — Find real-time answers at scale | Elastic

shorten the merge time by scaling your hardware (more disks, faster
disks)

tune the disk I/O throughput of EL 6, by mounting ext4 with "noatime"
flag, also high-performance RAID for pararallel write might be helpful, or
striping ES data directories over more than one disk, see also
Elasticsearch Platform — Find real-time answers at scale | Elastic

and Java 7 may also help a little bit, to gain most advantage from the
latest Java NIO subsystem developments, since Java 1.6.0_25 is quite aged.

Jörg

--

otisg · December 11, 2012, 3:11am

Maybe the query pattern changed.
Maybe queries hit more different areas of your disk and your disk IO is not
working as well as before.

It seems you have a lot of stuff running on that server.... poor server.

Can you pause indexing? If you could, you could try optimizing the index
to see if that makes any difference for reads. It wouldn't be a long-term
fix, it would simply be another data point to help you/us understand.

Btw. SPM for ES will show you some historical data, which can be handy in
cases like this, so you can look back and see when problems started, which
you can then correlate with your changes.

Otis

ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Monday, December 10, 2012 8:52:50 PM UTC-5, Eugene Strokin wrote:

Thank you,...
in a mean time, I've restored yesterday's backup of ES, which worked fine,
and run the Prod on it.
It works better, but I still see 100% HD utilization. I wander if the
problem could appear long before, but somehow it wasn't noticeable enough.
Also it shows that it almost doesn't use No-Heap memory (see attachment):

https://lh3.googleusercontent.com/-5WFNwI7bDyE/UMaRRooq4BI/AAAAAAAAAMk/FvWMmGRK0VU/s1600/Screen+Shot+2012-12-10+at+8.39.53+PM.png
Only 41Mb is used. I thought it should be utilized by cache. May be I'm
mistaken.
Here is another stat from BigDesk:

https://lh5.googleusercontent.com/-U4NHfyO2K2E/UMaRiCUCf7I/AAAAAAAAAMs/UJ0SYk4en1w/s1600/Screen+Shot+2012-12-10+at+8.41.20+PM.png

And more related to search/indexing load

https://lh3.googleusercontent.com/-xLSqHAoBXL8/UMaRsuzm2EI/AAAAAAAAAM0/S8-nO1mGhQc/s1600/Screen+Shot+2012-12-10+at+8.41.44+PM.png

Looks like nothing wrong,.. but HD 100% utilization still hiting the
performance.

I'll follow the suggestions you provided, and will keep you posted.

Thank you

On Monday, December 10, 2012 8:33:58 PM UTC-5, Jörg Prante wrote:

Just some general hints.

On a single box, there is no shard relocation I/O, so this can not be the
reason. You have only s single disk, which may become a bottleneck in heavy
load situations.

Usually, ES / Lucene does utilize disk I/O in the following situations:
while indexing, when Lucene segments grow and reach a critical size, a
massive merging process starts sooner or later. The larger the segments are
allowed to grow, the bigger the disk I/O utilization.

On the other hand, while searching, even with high query per seconds,
there will rarely be a utilization of a disk by 100%. You can verify this:
are there any "bad" queries? Can it be that queries can offload thousands,
millions of documents? Can you observe high network traffic to the search
clients?

How to decrease the sudden disk I/O load jump while segment merging? You
have several options:

tune the segment merging, see process
Elasticsearch Platform — Find real-time answers at scale | Elastic

shorten the merge time by scaling your hardware (more disks, faster
disks)

tune the disk I/O throughput of EL 6, by mounting ext4 with "noatime"
flag, also high-performance RAID for pararallel write might be helpful, or
striping ES data directories over more than one disk, see also
Elasticsearch Platform — Find real-time answers at scale | Elastic

and Java 7 may also help a little bit, to gain most advantage from the
latest Java NIO subsystem developments, since Java 1.6.0_25 is quite aged.

Jörg

--

Lukas_Vlcek1 · December 11, 2012, 7:13am

Hi,

as Otis suggested, try to stop indexing and see if it helps. I have seen
situation where system was hit quite hard from constant mappings updates
(can this be your situation?, though not sure if that would be related to
high IO). Also did you check the ES log files?

Regards,
Lukáš

--

jprante · December 11, 2012, 9:42pm

According to the graphs, I would like to suggest you should also watch out
for trouble causes outside the scope of ES. From the data, I am of the
impression that ES is well behaving.

Jörg

--

Eugene_Strokin · December 12, 2012, 3:48am

I hope this is hardware related issue. I've moved whole server to AWS
m1.medium. It has little less RAM, but a lot less CPU power.
CPU is constantly close to 100%, it is very slow, and sometimes
unresponsive, but still better than it is running on the server.
The ES had very little mapping updates. From whole history of the index,
may be 5-7 times mapping was updated.
I've made another experiment. While the ES was running on EC2 instance,
I've upgraded ES to 0.20.1 and reindex whole data. Pointed the application
back to the updated ES on the server, and it went down very fast. I've
pointed it to EC2 again, and it is barely but working. I was doing that
because I noticed in the ES logs that once it threw an exception during
back up. Before each back up I stop spooling and flush , back up, and run
spooling again. And once I saw an exception message, that flash is already
running.
It was a night before the crisis. But since I've reindex whole data,
starting from scratch, and still have the issue, I don't think this is
related. May be server was already slow, and that cause the exception
somehow.
I've ordered a new server with bigger RAM, hoping to keep whole index in
memory. And if I'm right and this is related to some kind of hardware
failure that would be it. But if it is not,.. I have no clue what else
could be done. There is a chance that I'm wrong, because I've moved the ES
to another hard drive, and no success. Could be the controller, but still
keep my fingers crossed.
Thank you for all your help

On Tuesday, December 11, 2012 4:42:42 PM UTC-5, Jörg Prante wrote:

According to the graphs, I would like to suggest you should also watch out
for trouble causes outside the scope of ES. From the data, I am of the
impression that ES is well behaving.

Jörg

--

Eugene_Strokin · December 12, 2012, 5:49pm

Just to update you guys, may be this information will help someone in
future: I've moved my original system (ES 0.19.0) to c1.xlarge EC2
instance, which is close to the server specs I had, and everything works
just fine. So this was hardware issue for sure. It is hard to say what was
exactly wrong with the hardware.
EC2 is a temp solution since for the money I need to pay for the instance
hours+traffic I'm getting way bigger dedicated server with enough RAM to
keep whole index in memory. This I hope will even improve the performance.
Also, I'm initiating work on optimizing queries. It looks like everything
as is could work better if queries would be written better.
Few problems I noticed:
Very often, because the nature of OOP, developers write functionality and
incapsulate it into DAOs, or such objects. So, we reuse the functionality,
which is grate in general, but not for performance. For example, I need to
get a document, perform some operation, get another document, sometimes not
even related to the first one, and so on. So each time it is a separate
call to ES. It could be just combined into mutiget, or multi search, or
batch, etc... But because there are many DAOs already in place it is just
easier to reuse them, instead of creating a special operation for the
specific task. Which would also lead to question about reusability, extra
QA cycles, etc... but I guess this is a trade for intensively used systems.
Another problem, we need to learn more about ES configuration optimization.
I'd just ask anybody who understands the subject well, write some examples
for some general use cases. For example, if queries use a lot of sorting,
then we would need to adjust this and that, but if some types of documents
are mostly retrieved using ID (get), then it is better to move them into
separate index, and so on...

Thank you for all your help and work,
Eugene

On Tuesday, December 11, 2012 10:48:27 PM UTC-5, Eugene Strokin wrote:

I hope this is hardware related issue. I've moved whole server to AWS
m1.medium. It has little less RAM, but a lot less CPU power.
CPU is constantly close to 100%, it is very slow, and sometimes
unresponsive, but still better than it is running on the server.
The ES had very little mapping updates. From whole history of the index,
may be 5-7 times mapping was updated.
I've made another experiment. While the ES was running on EC2 instance,
I've upgraded ES to 0.20.1 and reindex whole data. Pointed the application
back to the updated ES on the server, and it went down very fast. I've
pointed it to EC2 again, and it is barely but working. I was doing that
because I noticed in the ES logs that once it threw an exception during
back up. Before each back up I stop spooling and flush , back up, and run
spooling again. And once I saw an exception message, that flash is already
running.
It was a night before the crisis. But since I've reindex whole data,
starting from scratch, and still have the issue, I don't think this is
related. May be server was already slow, and that cause the exception
somehow.
I've ordered a new server with bigger RAM, hoping to keep whole index in
memory. And if I'm right and this is related to some kind of hardware
failure that would be it. But if it is not,.. I have no clue what else
could be done. There is a chance that I'm wrong, because I've moved the ES
to another hard drive, and no success. Could be the controller, but still
keep my fingers crossed.
Thank you for all your help

On Tuesday, December 11, 2012 4:42:42 PM UTC-5, Jörg Prante wrote:

According to the graphs, I would like to suggest you should also watch
out for trouble causes outside the scope of ES. From the data, I am of the
impression that ES is well behaving.

Jörg

--

jprante · December 12, 2012, 7:11pm

Hi Eugene,

to your observation of DAOs: it is true that by design, and by using
multitudes of search styles, the workload of searches often seems
scattered. But, a combination into multiget, multisearch or batch ... is
only a method to pack interactions into a single HTTP roundtrip, and has
only minimal impact on how Elasticsearch handles the load internally. As
you know HTTP comes with overhead (the header), the connection must
maintained with "keep-alive" for good reconnecting behavior and with
compression, and creating the serializations for request and responses
takes time. By using transport protocols like Thrift, Protobuf, Avro, or
Websockets, you can outperform HTTP between client and server, but, inside
the ES cluster, ES always distributes the load well over the nodes. Because
Elasticsearch will scale well with your increasing query load, it does not
matter if your query styles are well organized. When in doubt, you can add
some nodes, and the overall load will decrease. Don't overengineer your
application for a few percent of better efficiency, the perfect query
organization is hard to obtain and takes your time and energy. JM2C.

Cheers,

Jörg

--

Eugene_Strokin · December 13, 2012, 6:20pm

Jörg, you have very good point about keeping things in order on software
side and scale hardware if needed. I'm totally in agreement here. And just
to update everybody, I've finally moved the system to a new server (more
CPU, more RAM), and it is flying as a rocket. I'm not at the point of scale
to extra nodes, but we are getting there, since we are ever growing data
farm. And I have very positive feelings about ES in this area.
Thanks again,
Eugene

On Wednesday, December 12, 2012 2:11:43 PM UTC-5, Jörg Prante wrote:

Hi Eugene,

to your observation of DAOs: it is true that by design, and by using
multitudes of search styles, the workload of searches often seems
scattered. But, a combination into multiget, multisearch or batch ... is
only a method to pack interactions into a single HTTP roundtrip, and has
only minimal impact on how Elasticsearch handles the load internally. As
you know HTTP comes with overhead (the header), the connection must
maintained with "keep-alive" for good reconnecting behavior and with
compression, and creating the serializations for request and responses
takes time. By using transport protocols like Thrift, Protobuf, Avro, or
Websockets, you can outperform HTTP between client and server, but, inside
the ES cluster, ES always distributes the load well over the nodes. Because
Elasticsearch will scale well with your increasing query load, it does not
matter if your query styles are well organized. When in doubt, you can add
some nodes, and the overall load will decrease. Don't overengineer your
application for a few percent of better efficiency, the perfect query
organization is hard to obtain and takes your time and energy. JM2C.

Cheers,

Jörg

--

Topic		Replies	Views
High I/O usage on large Elasticsearch Instance Elasticsearch	9	2290	November 3, 2021
High OS memory usage and less heap memory usage during indexing Elasticsearch	6	4203	July 5, 2017
High CPU usage / load average while no running queries Elasticsearch	16	23083	February 5, 2019
High I/O read (100%) during query time Elasticsearch	1	954	July 5, 2017
High CPU Percentage Usage Elasticsearch	3	550	July 5, 2017

High HD utilization

I don't know if it helps, but here is another output of vmstat 5 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu

Otis

Related topics

I don't know if it helps, but here is another output of
vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu