Bulk indexing creates a lot of disk read OPS

Hello,

I've created an index I use for logging.

This means there are mostly writes, and some searches once in a while.
In the phase of the first loading, I'm using several clients to
concurrently index documents using the bulk API.

At first, indexing takes 200 ms for a bulk of 5000 documents.
As time goes by, the indexing time increases, and gets to 1000-4500 ms.

I am using an EC2 c3.8xl machine with 32 cores, and 60 GB of memory, with
an IO provisioned volume set to 7000 IOPS.

Looking at the metrics, I see that the CPU and memory are fine, the write
IOPS are at 300, but the read IOPS have slowly gone up and got to 7000.

How come I'm only indexing, but most of the IOPS are read?

I am attaching some screen captures from the BigDesk plugin, that show the
two states of the index, ater about 20% of the graphs is the point in time
where I stopped the clients, so you can see the load drop of.

My settings are:

threadpool.bulk.type: fixed
threadpool.bulk.size: 32 # availableProcessors
threadpool.bulk.queue_size: 1000

Indices settings

indices.memory.index_buffer_size: 50%

     376,1         97%

indices.cache.filter.expire: 6h

bootstrap.mlockall: true

and I've change the index settings to:

{"index":{"refresh_interval":"60m","translog":{"flush_threshold_size":"1gb","flush_threshold_ops":"50000"}}}
I also tried "refresh_interval":"-1"

Please let me know what else I need to provide if needed (settings, logs,
metrics)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ba2a238b-aade-4dcc-be96-12675b488d80%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

attachments hereby

On Friday, April 24, 2015 at 9:49:56 AM UTC+3, Eran wrote:

Hello,

I've created an index I use for logging.

This means there are mostly writes, and some searches once in a while.
In the phase of the first loading, I'm using several clients to
concurrently index documents using the bulk API.

At first, indexing takes 200 ms for a bulk of 5000 documents.
As time goes by, the indexing time increases, and gets to 1000-4500 ms.

I am using an EC2 c3.8xl machine with 32 cores, and 60 GB of memory, with
an IO provisioned volume set to 7000 IOPS.

Looking at the metrics, I see that the CPU and memory are fine, the write
IOPS are at 300, but the read IOPS have slowly gone up and got to 7000.

How come I'm only indexing, but most of the IOPS are read?

I am attaching some screen captures from the BigDesk plugin, that show the
two states of the index, ater about 20% of the graphs is the point in time
where I stopped the clients, so you can see the load drop of.

My settings are:

threadpool.bulk.type: fixed
threadpool.bulk.size: 32 # availableProcessors
threadpool.bulk.queue_size: 1000

Indices settings

indices.memory.index_buffer_size: 50%

       376,1         97%

indices.cache.filter.expire: 6h

bootstrap.mlockall: true

and I've change the index settings to:

{"index":{"refresh_interval":"60m","translog":{"flush_threshold_size":"1gb","flush_threshold_ops":"50000"}}}
I also tried "refresh_interval":"-1"

Please let me know what else I need to provide if needed (settings, logs,
metrics)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6c009296-df9c-4d0d-a4c7-e221a718a975%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Forgot some stats:

I have 10 shards, no replicas, all on the same machine.
ATM, there are some 1.5 billion records in the index.

On Friday, April 24, 2015 at 10:18:27 AM UTC+3, Eran wrote:

attachments hereby

On Friday, April 24, 2015 at 9:49:56 AM UTC+3, Eran wrote:

Hello,

I've created an index I use for logging.

This means there are mostly writes, and some searches once in a while.
In the phase of the first loading, I'm using several clients to
concurrently index documents using the bulk API.

At first, indexing takes 200 ms for a bulk of 5000 documents.
As time goes by, the indexing time increases, and gets to 1000-4500 ms.

I am using an EC2 c3.8xl machine with 32 cores, and 60 GB of memory, with
an IO provisioned volume set to 7000 IOPS.

Looking at the metrics, I see that the CPU and memory are fine, the write
IOPS are at 300, but the read IOPS have slowly gone up and got to 7000.

How come I'm only indexing, but most of the IOPS are read?

I am attaching some screen captures from the BigDesk plugin, that show
the two states of the index, ater about 20% of the graphs is the point in
time where I stopped the clients, so you can see the load drop of.

My settings are:

threadpool.bulk.type: fixed
threadpool.bulk.size: 32 # availableProcessors
threadpool.bulk.queue_size: 1000

Indices settings

indices.memory.index_buffer_size: 50%

       376,1         97%

indices.cache.filter.expire: 6h

bootstrap.mlockall: true

and I've change the index settings to:

{"index":{"refresh_interval":"60m","translog":{"flush_threshold_size":"1gb","flush_threshold_ops":"50000"}}}
I also tried "refresh_interval":"-1"

Please let me know what else I need to provide if needed (settings, logs,
metrics)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a64e78f3-5d69-4ca1-a3c9-86735a25343d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Merging segments could be the cause here?

David

Le 24 avr. 2015 à 09:54, Eran eranid@gmail.com a écrit :

Forgot some stats:

I have 10 shards, no replicas, all on the same machine.
ATM, there are some 1.5 billion records in the index.

On Friday, April 24, 2015 at 10:18:27 AM UTC+3, Eran wrote:
attachments hereby

On Friday, April 24, 2015 at 9:49:56 AM UTC+3, Eran wrote:
Hello,

I've created an index I use for logging.

This means there are mostly writes, and some searches once in a while.
In the phase of the first loading, I'm using several clients to concurrently index documents using the bulk API.

At first, indexing takes 200 ms for a bulk of 5000 documents.
As time goes by, the indexing time increases, and gets to 1000-4500 ms.

I am using an EC2 c3.8xl machine with 32 cores, and 60 GB of memory, with an IO provisioned volume set to 7000 IOPS.

Looking at the metrics, I see that the CPU and memory are fine, the write IOPS are at 300, but the read IOPS have slowly gone up and got to 7000.

How come I'm only indexing, but most of the IOPS are read?

I am attaching some screen captures from the BigDesk plugin, that show the two states of the index, ater about 20% of the graphs is the point in time where I stopped the clients, so you can see the load drop of.

My settings are:

threadpool.bulk.type: fixed
threadpool.bulk.size: 32 # availableProcessors
threadpool.bulk.queue_size: 1000

Indices settings

indices.memory.index_buffer_size: 50%
376,1 97%
indices.cache.filter.expire: 6h

bootstrap.mlockall: true

and I've change the index settings to:

{"index":{"refresh_interval":"60m","translog":{"flush_threshold_size":"1gb","flush_threshold_ops":"50000"}}}
I also tried "refresh_interval":"-1"

Please let me know what else I need to provide if needed (settings, logs, metrics)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a64e78f3-5d69-4ca1-a3c9-86735a25343d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/590AAAE0-75D2-45D2-B105-864444DF6521%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Hey David,

I suspect it indeed might be the cause, but I'm kind of a newbie here.
What metric do I need to monitor, what would be a problematic value, and
basically, how can I play with merge settings to test if I can improve this?
Some rules of thumbs for a newbie would be appreciated.

I installed the plugin SegmentSpy, and here is a screenshot, if that helps.

Eran

On Friday, April 24, 2015 at 11:02:27 AM UTC+3, David Pilato wrote:

Merging segments could be the cause here?

David

Le 24 avr. 2015 à 09:54, Eran <era...@gmail.com <javascript:>> a écrit :

Forgot some stats:

I have 10 shards, no replicas, all on the same machine.
ATM, there are some 1.5 billion records in the index.

On Friday, April 24, 2015 at 10:18:27 AM UTC+3, Eran wrote:

attachments hereby

On Friday, April 24, 2015 at 9:49:56 AM UTC+3, Eran wrote:

Hello,

I've created an index I use for logging.

This means there are mostly writes, and some searches once in a while.
In the phase of the first loading, I'm using several clients to
concurrently index documents using the bulk API.

At first, indexing takes 200 ms for a bulk of 5000 documents.
As time goes by, the indexing time increases, and gets to 1000-4500 ms.

I am using an EC2 c3.8xl machine with 32 cores, and 60 GB of memory,
with an IO provisioned volume set to 7000 IOPS.

Looking at the metrics, I see that the CPU and memory are fine, the
write IOPS are at 300, but the read IOPS have slowly gone up and got to
7000.

How come I'm only indexing, but most of the IOPS are read?

I am attaching some screen captures from the BigDesk plugin, that show
the two states of the index, ater about 20% of the graphs is the point in
time where I stopped the clients, so you can see the load drop of.

My settings are:

threadpool.bulk.type: fixed
threadpool.bulk.size: 32 # availableProcessors
threadpool.bulk.queue_size: 1000

Indices settings

indices.memory.index_buffer_size: 50%

         376,1         97%

indices.cache.filter.expire: 6h

bootstrap.mlockall: true

and I've change the index settings to:

{"index":{"refresh_interval":"60m","translog":{"flush_threshold_size":"1gb","flush_threshold_ops":"50000"}}}
I also tried "refresh_interval":"-1"

Please let me know what else I need to provide if needed (settings,
logs, metrics)

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a64e78f3-5d69-4ca1-a3c9-86735a25343d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a64e78f3-5d69-4ca1-a3c9-86735a25343d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dd232398-080a-488c-a952-b98c2a6da903%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

merging graph you shared, looks normal to me.

we had es with 10 shards too, and i monitor the segment using
segmentspy, the segment graph in your attachment shown pretty same
with ours.

jason

On Fri, Apr 24, 2015 at 4:45 PM, Eran eranid@gmail.com wrote:

Hey David,

I suspect it indeed might be the cause, but I'm kind of a newbie here.
What metric do I need to monitor, what would be a problematic value, and
basically, how can I play with merge settings to test if I can improve this?
Some rules of thumbs for a newbie would be appreciated.

I installed the plugin SegmentSpy, and here is a screenshot, if that helps.

Eran

On Friday, April 24, 2015 at 11:02:27 AM UTC+3, David Pilato wrote:

Merging segments could be the cause here?

David

Le 24 avr. 2015 à 09:54, Eran era...@gmail.com a écrit :

Forgot some stats:

I have 10 shards, no replicas, all on the same machine.
ATM, there are some 1.5 billion records in the index.

On Friday, April 24, 2015 at 10:18:27 AM UTC+3, Eran wrote:

attachments hereby

On Friday, April 24, 2015 at 9:49:56 AM UTC+3, Eran wrote:

Hello,

I've created an index I use for logging.

This means there are mostly writes, and some searches once in a while.
In the phase of the first loading, I'm using several clients to
concurrently index documents using the bulk API.

At first, indexing takes 200 ms for a bulk of 5000 documents.
As time goes by, the indexing time increases, and gets to 1000-4500 ms.

I am using an EC2 c3.8xl machine with 32 cores, and 60 GB of memory,
with an IO provisioned volume set to 7000 IOPS.

Looking at the metrics, I see that the CPU and memory are fine, the
write IOPS are at 300, but the read IOPS have slowly gone up and got to
7000.

How come I'm only indexing, but most of the IOPS are read?

I am attaching some screen captures from the BigDesk plugin, that show
the two states of the index, ater about 20% of the graphs is the point in
time where I stopped the clients, so you can see the load drop of.

My settings are:

threadpool.bulk.type: fixed
threadpool.bulk.size: 32 # availableProcessors
threadpool.bulk.queue_size: 1000

Indices settings

indices.memory.index_buffer_size: 50%

376,1 97%
indices.cache.filter.expire: 6h

bootstrap.mlockall: true

and I've change the index settings to:

{"index":{"refresh_interval":"60m","translog":{"flush_threshold_size":"1gb","flush_threshold_ops":"50000"}}}
I also tried "refresh_interval":"-1"

Please let me know what else I need to provide if needed (settings,
logs, metrics)

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a64e78f3-5d69-4ca1-a3c9-86735a25343d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/dd232398-080a-488c-a952-b98c2a6da903%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHO4itxe6Kv7wLfNxzvC-xa73sN4UUUj0Zgqcuk4Bogix-nkjA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

That’s normal. I was just answering that even if you think you are only writing data while indexing, you are also reading data behind the scene to merge Lucene segments.
You can potentially try to play with index.translog.flush_threshold_size

http://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-translog.html http://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-translog.html

And increase the transaction log size?

It might help reducing the number of segments generated but that said you will always have READs operations.

Actually, is it an issue for you? If not, keeping all defaults values might be good.

Best

--
David Pilato - Developer | Evangelist
elastic.co
@dadoonet https://twitter.com/dadoonet | @elasticsearchfr https://twitter.com/elasticsearchfr | @scrutmydocs https://twitter.com/scrutmydocs

Le 24 avr. 2015 à 10:45, Eran eranid@gmail.com a écrit :

Hey David,

I suspect it indeed might be the cause, but I'm kind of a newbie here.
What metric do I need to monitor, what would be a problematic value, and basically, how can I play with merge settings to test if I can improve this?
Some rules of thumbs for a newbie would be appreciated.

I installed the plugin SegmentSpy, and here is a screenshot, if that helps.

Eran

On Friday, April 24, 2015 at 11:02:27 AM UTC+3, David Pilato wrote:
Merging segments could be the cause here?

David

Le 24 avr. 2015 à 09:54, Eran <era...@gmail.com <javascript:>> a écrit :

Forgot some stats:

I have 10 shards, no replicas, all on the same machine.
ATM, there are some 1.5 billion records in the index.

On Friday, April 24, 2015 at 10:18:27 AM UTC+3, Eran wrote:
attachments hereby

On Friday, April 24, 2015 at 9:49:56 AM UTC+3, Eran wrote:
Hello,

I've created an index I use for logging.

This means there are mostly writes, and some searches once in a while.
In the phase of the first loading, I'm using several clients to concurrently index documents using the bulk API.

At first, indexing takes 200 ms for a bulk of 5000 documents.
As time goes by, the indexing time increases, and gets to 1000-4500 ms.

I am using an EC2 c3.8xl machine with 32 cores, and 60 GB of memory, with an IO provisioned volume set to 7000 IOPS.

Looking at the metrics, I see that the CPU and memory are fine, the write IOPS are at 300, but the read IOPS have slowly gone up and got to 7000.

How come I'm only indexing, but most of the IOPS are read?

I am attaching some screen captures from the BigDesk plugin, that show the two states of the index, ater about 20% of the graphs is the point in time where I stopped the clients, so you can see the load drop of.

My settings are:

threadpool.bulk.type: fixed
threadpool.bulk.size: 32 # availableProcessors
threadpool.bulk.queue_size: 1000

Indices settings

indices.memory.index_buffer_size: 50%
376,1 97%
indices.cache.filter.expire: 6h

bootstrap.mlockall: true

and I've change the index settings to:

{"index":{"refresh_interval":"60m","translog":{"flush_threshold_size":"1gb","flush_threshold_ops":"50000"}}}
I also tried "refresh_interval":"-1"

Please let me know what else I need to provide if needed (settings, logs, metrics)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a64e78f3-5d69-4ca1-a3c9-86735a25343d%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/a64e78f3-5d69-4ca1-a3c9-86735a25343d%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dd232398-080a-488c-a952-b98c2a6da903%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/dd232398-080a-488c-a952-b98c2a6da903%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.
<Screen Shot 2015-04-24 at 11.42.16.png>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/C2CCCCA1-C204-43D7-A7BE-AD885AB8298A%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

It is an issue, as I am hitting 7000 read operations per second (the limit
of my volume's iops)

As the index grow larger the problem worsens, and as I was once able to
update with a 10 clients concurrently, now I can barely use one client.

Also, I used an _optimize endpoint to have all segments synced, and even
then, the read operations spike immediately on the first indexing operation
(I'm using BigDesk to follow this). So I do not think it is a merge effect,
as my intuition would be a merge happens every once in a while?
Maybe this is actually a result of me not using "doc values"? could that be
it?

On Friday, April 24, 2015 at 12:28:50 PM UTC+3, David Pilato wrote:

That’s normal. I was just answering that even if you think you are only
writing data while indexing, you are also reading data behind the scene to
merge Lucene segments.
You can potentially try to play with index.translog.flush_threshold_size

http://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-translog.html

And increase the transaction log size?

It might help reducing the number of segments generated but that said you
will always have READs operations.

Actually, is it an issue for you? If not, keeping all defaults values
might be good.

Best

--
David Pilato - Developer | Evangelist
elastic.co http://elastic.co
@dadoonet https://twitter.com/dadoonet | @elasticsearchfr
https://twitter.com/elasticsearchfr | @scrutmydocs
https://twitter.com/scrutmydocs

Le 24 avr. 2015 à 10:45, Eran <era...@gmail.com <javascript:>> a écrit :

Hey David,

I suspect it indeed might be the cause, but I'm kind of a newbie here.
What metric do I need to monitor, what would be a problematic value, and
basically, how can I play with merge settings to test if I can improve this?
Some rules of thumbs for a newbie would be appreciated.

I installed the plugin SegmentSpy, and here is a screenshot, if that helps.

Eran

On Friday, April 24, 2015 at 11:02:27 AM UTC+3, David Pilato wrote:

Merging segments could be the cause here?

David

Le 24 avr. 2015 à 09:54, Eran era...@gmail.com a écrit :

Forgot some stats:

I have 10 shards, no replicas, all on the same machine.
ATM, there are some 1.5 billion records in the index.

On Friday, April 24, 2015 at 10:18:27 AM UTC+3, Eran wrote:

attachments hereby

On Friday, April 24, 2015 at 9:49:56 AM UTC+3, Eran wrote:

Hello,

I've created an index I use for logging.

This means there are mostly writes, and some searches once in a while.
In the phase of the first loading, I'm using several clients to
concurrently index documents using the bulk API.

At first, indexing takes 200 ms for a bulk of 5000 documents.
As time goes by, the indexing time increases, and gets to 1000-4500 ms.

I am using an EC2 c3.8xl machine with 32 cores, and 60 GB of memory,
with an IO provisioned volume set to 7000 IOPS.

Looking at the metrics, I see that the CPU and memory are fine, the
write IOPS are at 300, but the read IOPS have slowly gone up and got to
7000.

How come I'm only indexing, but most of the IOPS are read?

I am attaching some screen captures from the BigDesk plugin, that show
the two states of the index, ater about 20% of the graphs is the point in
time where I stopped the clients, so you can see the load drop of.

My settings are:

threadpool.bulk.type: fixed
threadpool.bulk.size: 32 # availableProcessors
threadpool.bulk.queue_size: 1000

Indices settings

indices.memory.index_buffer_size: 50%

         376,1         97%

indices.cache.filter.expire: 6h

bootstrap.mlockall: true

and I've change the index settings to:

{"index":{"refresh_interval":"60m","translog":{"flush_threshold_size":"1gb","flush_threshold_ops":"50000"}}}
I also tried "refresh_interval":"-1"

Please let me know what else I need to provide if needed (settings,
logs, metrics)

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a64e78f3-5d69-4ca1-a3c9-86735a25343d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a64e78f3-5d69-4ca1-a3c9-86735a25343d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/dd232398-080a-488c-a952-b98c2a6da903%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/dd232398-080a-488c-a952-b98c2a6da903%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
<Screen Shot 2015-04-24 at 11.42.16.png>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/854e4e94-32ce-4fcd-9dba-7a0e57923b82%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Eran,

Which version of Elasticsearch are you using?

Are you assigning your own document IDs or letting Elasticsearch assign
them automatically?

Best regards,

Christian

On Friday, April 24, 2015 at 7:49:56 AM UTC+1, Eran wrote:

Hello,

I've created an index I use for logging.

This means there are mostly writes, and some searches once in a while.
In the phase of the first loading, I'm using several clients to
concurrently index documents using the bulk API.

At first, indexing takes 200 ms for a bulk of 5000 documents.
As time goes by, the indexing time increases, and gets to 1000-4500 ms.

I am using an EC2 c3.8xl machine with 32 cores, and 60 GB of memory, with
an IO provisioned volume set to 7000 IOPS.

Looking at the metrics, I see that the CPU and memory are fine, the write
IOPS are at 300, but the read IOPS have slowly gone up and got to 7000.

How come I'm only indexing, but most of the IOPS are read?

I am attaching some screen captures from the BigDesk plugin, that show the
two states of the index, ater about 20% of the graphs is the point in time
where I stopped the clients, so you can see the load drop of.

My settings are:

threadpool.bulk.type: fixed
threadpool.bulk.size: 32 # availableProcessors
threadpool.bulk.queue_size: 1000

Indices settings

indices.memory.index_buffer_size: 50%

       376,1         97%

indices.cache.filter.expire: 6h

bootstrap.mlockall: true

and I've change the index settings to:

{"index":{"refresh_interval":"60m","translog":{"flush_threshold_size":"1gb","flush_threshold_ops":"50000"}}}
I also tried "refresh_interval":"-1"

Please let me know what else I need to provide if needed (settings, logs,
metrics)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1ee7a991-d6d5-4240-be92-e73db63cccf5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I'm using the newest version, 1.5.1
I'm assigning my own ID using path:

"_id": {
"path": "msg_id"
},

msg_id is a self generated, hashed identifier (it's actually somewhat like
a cookie ID)

On Friday, April 24, 2015 at 1:47:39 PM UTC+3,
christian...@elasticsearch.com wrote:

Hi Eran,

Which version of Elasticsearch are you using?

Are you assigning your own document IDs or letting Elasticsearch assign
them automatically?

Best regards,

Christian

On Friday, April 24, 2015 at 7:49:56 AM UTC+1, Eran wrote:

Hello,

I've created an index I use for logging.

This means there are mostly writes, and some searches once in a while.
In the phase of the first loading, I'm using several clients to
concurrently index documents using the bulk API.

At first, indexing takes 200 ms for a bulk of 5000 documents.
As time goes by, the indexing time increases, and gets to 1000-4500 ms.

I am using an EC2 c3.8xl machine with 32 cores, and 60 GB of memory, with
an IO provisioned volume set to 7000 IOPS.

Looking at the metrics, I see that the CPU and memory are fine, the write
IOPS are at 300, but the read IOPS have slowly gone up and got to 7000.

How come I'm only indexing, but most of the IOPS are read?

I am attaching some screen captures from the BigDesk plugin, that show
the two states of the index, ater about 20% of the graphs is the point in
time where I stopped the clients, so you can see the load drop of.

My settings are:

threadpool.bulk.type: fixed
threadpool.bulk.size: 32 # availableProcessors
threadpool.bulk.queue_size: 1000

Indices settings

indices.memory.index_buffer_size: 50%

       376,1         97%

indices.cache.filter.expire: 6h

bootstrap.mlockall: true

and I've change the index settings to:

{"index":{"refresh_interval":"60m","translog":{"flush_threshold_size":"1gb","flush_threshold_ops":"50000"}}}
I also tried "refresh_interval":"-1"

Please let me know what else I need to provide if needed (settings, logs,
metrics)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/69c5c4f1-61b9-4dba-915a-93fba2b818e9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Eran,

If you are assigning your own ID, Elasticsearch need to search and check if
the document already exists before writing it. This could explain why the
bulk insert performance goes down as the size of the index grows. If you
are not going to update the documents, I would therefore recommend allowing
Elasticsearch to assign the document ID automatically.

Best regards,

Christian

On Friday, April 24, 2015 at 7:49:56 AM UTC+1, Eran wrote:

Hello,

I've created an index I use for logging.

This means there are mostly writes, and some searches once in a while.
In the phase of the first loading, I'm using several clients to
concurrently index documents using the bulk API.

At first, indexing takes 200 ms for a bulk of 5000 documents.
As time goes by, the indexing time increases, and gets to 1000-4500 ms.

I am using an EC2 c3.8xl machine with 32 cores, and 60 GB of memory, with
an IO provisioned volume set to 7000 IOPS.

Looking at the metrics, I see that the CPU and memory are fine, the write
IOPS are at 300, but the read IOPS have slowly gone up and got to 7000.

How come I'm only indexing, but most of the IOPS are read?

I am attaching some screen captures from the BigDesk plugin, that show the
two states of the index, ater about 20% of the graphs is the point in time
where I stopped the clients, so you can see the load drop of.

My settings are:

threadpool.bulk.type: fixed
threadpool.bulk.size: 32 # availableProcessors
threadpool.bulk.queue_size: 1000

Indices settings

indices.memory.index_buffer_size: 50%

       376,1         97%

indices.cache.filter.expire: 6h

bootstrap.mlockall: true

and I've change the index settings to:

{"index":{"refresh_interval":"60m","translog":{"flush_threshold_size":"1gb","flush_threshold_ops":"50000"}}}
I also tried "refresh_interval":"-1"

Please let me know what else I need to provide if needed (settings, logs,
metrics)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f3ad37d7-a070-4065-aa85-6f38d4329502%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Wow, awsome. I'll try that, Thanks!

On Friday, April 24, 2015 at 2:17:45 PM UTC+3,
christian...@elasticsearch.com wrote:

Hi Eran,

If you are assigning your own ID, Elasticsearch need to search and check
if the document already exists before writing it. This could explain why
the bulk insert performance goes down as the size of the index grows. If
you are not going to update the documents, I would therefore recommend
allowing Elasticsearch to assign the document ID automatically.

Best regards,

Christian

On Friday, April 24, 2015 at 7:49:56 AM UTC+1, Eran wrote:

Hello,

I've created an index I use for logging.

This means there are mostly writes, and some searches once in a while.
In the phase of the first loading, I'm using several clients to
concurrently index documents using the bulk API.

At first, indexing takes 200 ms for a bulk of 5000 documents.
As time goes by, the indexing time increases, and gets to 1000-4500 ms.

I am using an EC2 c3.8xl machine with 32 cores, and 60 GB of memory, with
an IO provisioned volume set to 7000 IOPS.

Looking at the metrics, I see that the CPU and memory are fine, the write
IOPS are at 300, but the read IOPS have slowly gone up and got to 7000.

How come I'm only indexing, but most of the IOPS are read?

I am attaching some screen captures from the BigDesk plugin, that show
the two states of the index, ater about 20% of the graphs is the point in
time where I stopped the clients, so you can see the load drop of.

My settings are:

threadpool.bulk.type: fixed
threadpool.bulk.size: 32 # availableProcessors
threadpool.bulk.queue_size: 1000

Indices settings

indices.memory.index_buffer_size: 50%

       376,1         97%

indices.cache.filter.expire: 6h

bootstrap.mlockall: true

and I've change the index settings to:

{"index":{"refresh_interval":"60m","translog":{"flush_threshold_size":"1gb","flush_threshold_ops":"50000"}}}
I also tried "refresh_interval":"-1"

Please let me know what else I need to provide if needed (settings, logs,
metrics)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/84687c05-49a5-4e0a-9a4f-41e4136a120a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.