ES slowness and calls taking a long time to return


(T Vinod Gupta) #1

Hi,
I have a single node ES (0.18.7 version) setup. unfortunately, i didn't
change the default config much and it has 5 shards.. and now i have quite a
bit of production data stored on it (12GB). what we are seeing is reduced
throughput over time and search times sometimes as high as few minutes.. im
looking at some help on how to bring the situation under control as we are
constantly indexing data and also serving realtime customer requests.

questions

  1. is it possible to reduce the number of shards from 5 to 2 somehow? does
    that work once the system is already in place?

  2. i read somewhere that it could be due to threadpool pressure. but the
    node stats ( curl -XGET '
    http://localhost:9200/_clustr/nodes/stats?pretty=true') is not giving
    thread pool information. how do i get around identifying the root cause?

  3. my throughput is around 300-400 index calls per sec. how do i make it
    higher?

  4. if i were to optimize such that my gets and search calls are faster, is
    it possible? it can be at the expense of slower index calls.

this is on a dual core machine (ec2 m1.large instance) and i gave ES 4GB
ram. has there been any benchmarking done on ec2 instances.

let me know if any further info is needed.

thanks


(Radu Gheorghe) #2

Hi,

On Saturday, July 7, 2012 3:05:03 AM UTC+3, T Vinod Gupta wrote:

Hi,
I have a single node ES (0.18.7 version) setup. unfortunately, i didn't
change the default config much and it has 5 shards.. and now i have quite a
bit of production data stored on it (12GB). what we are seeing is reduced
throughput over time and search times sometimes as high as few minutes.. im
looking at some help on how to bring the situation under control as we are
constantly indexing data and also serving realtime customer requests.

questions

  1. is it possible to reduce the number of shards from 5 to 2 somehow? does
    that work once the system is already in place?

  2. i read somewhere that it could be due to threadpool pressure. but the
    node stats ( curl -XGET '
    http://localhost:9200/_clustr/nodes/stats?pretty=true') is not giving
    thread pool information. how do i get around identifying the root cause?

I would start by looking at BigDesk and in the logs.

  1. my throughput is around 300-400 index calls per sec. how do i make it
    higher?

It depends a lot on how you data looks like. But increasing the
refresh_interval should always help.

  1. if i were to optimize such that my gets and search calls are faster, is
    it possible? it can be at the expense of slower index calls.

How does your data and searches look like?

If you find your storage slow, you might benefit from compressing your
source. I would also try upgrading ES to a newer version. I find it faster,
although I don't have a clear benchmark to show that. Please note that
upgrading needs some care. Quote:
Upgrade Notes:

  • Upgrading from 0.18 requires issuing a full flush of all the indices
    in the cluster (curl host:9200/_flush) before shutting down the cluster,
    with no indexing operations happening after the flush.
  • The local gateway state structure has changed, a backup of the state
    files is created when upgrading, they can then be used to downgrade back to
    0.18. Don’t downgrade without using them.

this is on a dual core machine (ec2 m1.large instance) and i gave ES 4GB
ram. has there been any benchmarking done on ec2 instances.

let me know if any further info is needed.

thanks


(T Vinod Gupta) #3

Thanks Radu..
I increased the refresh interval to 60sec.. that didnt help.. i see bunch
of error messages in elasticsearch.log file that look like below. could
that be the reason for slow search? now i see slowness even when there is
not much indexing happening. these messages occur twice/thrice a minute.

[2012-07-09 00:00:55,238][WARN ][index.merge.scheduler ] [] [facebook][3] failed to merge
java.io.IOException: Input/output error:
NIOFSIndexInput(path="/media/ephemeral0/ES_data/elasticsearch/nodes/0/indices/facebook/3/index/_qicw.fdt")
at
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:180)
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:155)
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:110)
at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:123)
at
org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:216)
at
org.apache.lucene.index.SegmentMerger.copyFieldsWithDeletions(SegmentMerger.java:301)
at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:248)
at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:108)
at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4295)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3940)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
at
org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:88)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)
Caused by: java.io.IOException: Input/output error
at sun.nio.ch.FileDispatcher.pread0(Native Method)
at sun.nio.ch.FileDispatcher.pread(FileDispatcher.java:49)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:248)
at sun.nio.ch.IOUtil.read(IOUtil.java:224)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:663)
at
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:162)
... 12 more

also, i was not able to install bigdesk - due to below error.

sudo bin/plugin -install lukas-vlcek/bigdesk/1.0.0
-> Installing lukas-vlcek/bigdesk/1.0.0...
Trying https://github.com/downloads/lukas-vlcek/bigdesk/bigdesk-1.0.0.zip...
Trying https://github.com/lukas-vlcek/bigdesk/zipball/v1.0.0...
Failed to install lukas-vlcek/bigdesk/1.0.0, reason: failed to download

i would really appreciate any help i can get here..

thanks

On Sat, Jul 7, 2012 at 8:08 AM, Radu Gheorghe radu0gheorghe@gmail.comwrote:

Hi,

On Saturday, July 7, 2012 3:05:03 AM UTC+3, T Vinod Gupta wrote:

Hi,
I have a single node ES (0.18.7 version) setup. unfortunately, i didn't
change the default config much and it has 5 shards.. and now i have quite a
bit of production data stored on it (12GB). what we are seeing is reduced
throughput over time and search times sometimes as high as few minutes.. im
looking at some help on how to bring the situation under control as we are
constantly indexing data and also serving realtime customer requests.

questions

  1. is it possible to reduce the number of shards from 5 to 2 somehow?
    does that work once the system is already in place?

  2. i read somewhere that it could be due to threadpool pressure. but the
    node stats ( curl -XGET 'http://localhost:9200/_**
    clustr/nodes/stats?pretty=truehttp://localhost:9200/_clustr/nodes/stats?pretty=true
    **') is not giving thread pool information. how do i get around
    identifying the root cause?

I would start by looking at BigDesk and in the logs.

  1. my throughput is around 300-400 index calls per sec. how do i make it
    higher?

It depends a lot on how you data looks like. But increasing the
refresh_interval should always help.

  1. if i were to optimize such that my gets and search calls are faster,
    is it possible? it can be at the expense of slower index calls.

How does your data and searches look like?

If you find your storage slow, you might benefit from compressing your
source. I would also try upgrading ES to a newer version. I find it faster,
although I don't have a clear benchmark to show that. Please note that
upgrading needs some care. Quote:
Upgrade Notes:

  • Upgrading from 0.18 requires issuing a full flush of all the indices
    in the cluster (curl host:9200/_flush) before shutting down the
    cluster, with no indexing operations happening after the flush.
  • The local gateway state structure has changed, a backup of the state
    files is created when upgrading, they can then be used to downgrade back to
    0.18. Don’t downgrade without using them.

this is on a dual core machine (ec2 m1.large instance) and i gave ES 4GB
ram. has there been any benchmarking done on ec2 instances.

let me know if any further info is needed.

thanks


(Radu Gheorghe) #4

On Monday, July 9, 2012 5:19:06 AM UTC+3, T Vinod Gupta wrote:

Thanks Radu..
I increased the refresh interval to 60sec.. that didnt help.. i see bunch
of error messages in elasticsearch.log file that look like below. could
that be the reason for slow search? now i see slowness even when there is
not much indexing happening. these messages occur twice/thrice a minute.

I don't know what that error means, besides from what the text says (read
error). And I don't know how that would impact performance. I mean, it must
have a performance impact, I just don't know how significant it is.

[2012-07-09 00:00:55,238][WARN ][index.merge.scheduler ] [] [facebook][3] failed to merge
java.io.IOException: Input/output error:
NIOFSIndexInput(path="/media/ephemeral0/ES_data/elasticsearch/nodes/0/indices/facebook/3/index/_qicw.fdt")
at
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:180)
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:155)
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:110)
at
org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:123)
at
org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:216)
at
org.apache.lucene.index.SegmentMerger.copyFieldsWithDeletions(SegmentMerger.java:301)
at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:248)
at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:108)
at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4295)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3940)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
at
org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:88)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)
Caused by: java.io.IOException: Input/output error
at sun.nio.ch.FileDispatcher.pread0(Native Method)
at sun.nio.ch.FileDispatcher.pread(FileDispatcher.java:49)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:248)
at sun.nio.ch.IOUtil.read(IOUtil.java:224)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:663)
at
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:162)
... 12 more

also, i was not able to install bigdesk - due to below error.

sudo bin/plugin -install lukas-vlcek/bigdesk/1.0.0
-> Installing lukas-vlcek/bigdesk/1.0.0...
Trying https://github.com/downloads/lukas-vlcek/bigdesk/bigdesk-1.0.0.zip.
..
Trying https://github.com/lukas-vlcek/bigdesk/zipball/v1.0.0...
Failed to install lukas-vlcek/bigdesk/1.0.0, reason: failed to download

Do you use a proxy or something?

Anyway, I think you can just download it from here:
https://github.com/lukas-vlcek/bigdesk/zipball/0.18.x

then extract it and open index.html from the lukas-vlcek... directory.

i would really appreciate any help i can get here..

thanks

On Sat, Jul 7, 2012 at 8:08 AM, Radu Gheorghe <> wrote:

Hi,

On Saturday, July 7, 2012 3:05:03 AM UTC+3, T Vinod Gupta wrote:

Hi,
I have a single node ES (0.18.7 version) setup. unfortunately, i didn't
change the default config much and it has 5 shards.. and now i have quite a
bit of production data stored on it (12GB). what we are seeing is reduced
throughput over time and search times sometimes as high as few minutes.. im
looking at some help on how to bring the situation under control as we are
constantly indexing data and also serving realtime customer requests.

questions

  1. is it possible to reduce the number of shards from 5 to 2 somehow?
    does that work once the system is already in place?

  2. i read somewhere that it could be due to threadpool pressure. but the
    node stats ( curl -XGET 'http://localhost:9200/_**
    clustr/nodes/stats?pretty=truehttp://localhost:9200/_clustr/nodes/stats?pretty=true
    **') is not giving thread pool information. how do i get around
    identifying the root cause?

I would start by looking at BigDesk and in the logs.

  1. my throughput is around 300-400 index calls per sec. how do i make it
    higher?

It depends a lot on how you data looks like. But increasing the
refresh_interval should always help.

  1. if i were to optimize such that my gets and search calls are faster,
    is it possible? it can be at the expense of slower index calls.

How does your data and searches look like?

If you find your storage slow, you might benefit from compressing your
source. I would also try upgrading ES to a newer version. I find it faster,
although I don't have a clear benchmark to show that. Please note that
upgrading needs some care. Quote:
Upgrade Notes:

  • Upgrading from 0.18 requires issuing a full flush of all the
    indices in the cluster (curl host:9200/_flush) before shutting down
    the cluster, with no indexing operations happening after the flush.
  • The local gateway state structure has changed, a backup of the
    state files is created when upgrading, they can then be used to downgrade
    back to 0.18. Don’t downgrade without using them.

this is on a dual core machine (ec2 m1.large instance) and i gave ES 4GB
ram. has there been any benchmarking done on ec2 instances.

let me know if any further info is needed.

thanks


(Jörg Prante) #5

Hi,

such errors like below are outside of ES but should be interpreted as a
serious hint that the system is not able to write via NIO because of disk
errors, file system errors, short on resources etc. So I'd watch out for
messages of the system (syslog, disk damages, disk full, etc.)

Best regards,

Jörg

Quote:
[...]
Caused by: java.io.IOException: Input/output error
at sun.nio.ch.FileDispatcher.pread0(Native Method)
at sun.nio.ch.FileDispatcher.pread(FileDispatcher.java:49)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:248)
at sun.nio.ch.IOUtil.read(IOUtil.java:224)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:663)
at
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:162)


(system) #6