Elasticsearch bad indexing timing

Hey,
I am trying to migrate (copy) 35 million documents (which is a standard
amount, not too big) between couchbase to elasticsearch.

My elasticsearch cluster composed from 3 A3 (4 cores, 7 GB memory) CentOS
Severs on Microsoft Azure (each server equals to a large server on Amazon)..

I used "timing data flow" indexing to store the docuemnts. each index
represents a month and composed by 3 shards and 2 replicas.

when i start the migration script i see that the insertion time is becoming
very slow (about 10 documents per second) and the load average of each
server in the cluster jumping over than 1.5.
In addition, the JVM memory is being increased almost to 100% while the cpu
shows 20% and the IOps shows 20 at max.
(i used Marvel CNC to get all these data)

  1. Does anyone faced these kind of indexing problems in elasticsearch?
  2. I would like to know if there are any parameters that i should be aware
    about to extend java memory?
  3. is my cluster specifications good enough to handle 100 indexing per
    second.
  4. is the indexing time depends on how big is the index? and should it be
    that slow?

I also open a thread in stackover flow if anyone want to keep update:

Thnx Niv

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6aee6f5f-efcb-4e37-85ef-41af2ae89a39%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

A couple of suggestions:

  1. Disable replicas before large amounts of inserts (set replica count to
    0), and only enable it afterwards again.

  2. Use batching
    http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html,
    actual batch size would depends on many factors (doc sizes, network,
    instances strengths)

  3. Follow ES's advice on node setup, e.g. allocate 50% of the available
    memory size to the Java heap of ES, don't run anything else on that
    machine, and disable swappiness.

  4. Your index is already sharded, try spreading it out to 3 different
    servers instead of having them on one server ("virtual shards"). This will
    help fan out the indexing load.

  5. If you don't specify the document IDs yourself, make sure you use the
    latest ES, there's a significant improvement there in the ID generation
    mechanism which could help speeding up things.

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Sun, Sep 14, 2014 at 11:38 AM, Niv Penso nivp@toonimo.com wrote:

Hey,
I am trying to migrate (copy) 35 million documents (which is a standard
amount, not too big) between couchbase to elasticsearch.

My elasticsearch cluster composed from 3 A3 (4 cores, 7 GB memory) CentOS
Severs on Microsoft Azure (each server equals to a large server on Amazon)..

I used "timing data flow" indexing to store the docuemnts. each index
represents a month and composed by 3 shards and 2 replicas.

when i start the migration script i see that the insertion time is
becoming very slow (about 10 documents per second) and the load average of
each server in the cluster jumping over than 1.5.
In addition, the JVM memory is being increased almost to 100% while the
cpu shows 20% and the IOps shows 20 at max.
(i used Marvel CNC to get all these data)

  1. Does anyone faced these kind of indexing problems in elasticsearch?
  2. I would like to know if there are any parameters that i should be aware
    about to extend java memory?
  3. is my cluster specifications good enough to handle 100 indexing per
    second.
  4. is the indexing time depends on how big is the index? and should it be
    that slow?

I also open a thread in stackover flow if anyone want to keep update:
azure - Elasticsearch bad indexing time - Stack Overflow

Thnx Niv

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6aee6f5f-efcb-4e37-85ef-41af2ae89a39%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6aee6f5f-efcb-4e37-85ef-41af2ae89a39%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zur6wDgSbrpnp8cVqy3mENonDXU%2BU1ZMwjq6D9P8537Lw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Amazing answer helped me so much!!!
the load-avg decreased to normal number and the documents index per second
increased to 85
Thnx
Niv

On Sunday, September 14, 2014 1:29:42 PM UTC+3, Itamar Syn-Hershko wrote:

A couple of suggestions:

  1. Disable replicas before large amounts of inserts (set replica count to
    0), and only enable it afterwards again.

  2. Use batching
    http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html,
    actual batch size would depends on many factors (doc sizes, network,
    instances strengths)

  3. Follow ES's advice on node setup, e.g. allocate 50% of the available
    memory size to the Java heap of ES, don't run anything else on that
    machine, and disable swappiness.

  4. Your index is already sharded, try spreading it out to 3 different
    servers instead of having them on one server ("virtual shards"). This will
    help fan out the indexing load.

  5. If you don't specify the document IDs yourself, make sure you use the
    latest ES, there's a significant improvement there in the ID generation
    mechanism which could help speeding up things.

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Sun, Sep 14, 2014 at 11:38 AM, Niv Penso <ni...@toonimo.com
<javascript:>> wrote:

Hey,
I am trying to migrate (copy) 35 million documents (which is a standard
amount, not too big) between couchbase to elasticsearch.

My elasticsearch cluster composed from 3 A3 (4 cores, 7 GB memory) CentOS
Severs on Microsoft Azure (each server equals to a large server on Amazon)..

I used "timing data flow" indexing to store the docuemnts. each index
represents a month and composed by 3 shards and 2 replicas.

when i start the migration script i see that the insertion time is
becoming very slow (about 10 documents per second) and the load average of
each server in the cluster jumping over than 1.5.
In addition, the JVM memory is being increased almost to 100% while the
cpu shows 20% and the IOps shows 20 at max.
(i used Marvel CNC to get all these data)

  1. Does anyone faced these kind of indexing problems in elasticsearch?
  2. I would like to know if there are any parameters that i should be
    aware about to extend java memory?
  3. is my cluster specifications good enough to handle 100 indexing per
    second.
  4. is the indexing time depends on how big is the index? and should it be
    that slow?

I also open a thread in stackover flow if anyone want to keep update:
azure - Elasticsearch bad indexing time - Stack Overflow

Thnx Niv

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6aee6f5f-efcb-4e37-85ef-41af2ae89a39%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6aee6f5f-efcb-4e37-85ef-41af2ae89a39%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/37180829-62ff-4203-bfdf-fa47eeebf338%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sure thing

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Sun, Sep 14, 2014 at 7:19 PM, Niv Penso nivp@toonimo.com wrote:

Amazing answer helped me so much!!!
the load-avg decreased to normal number and the documents index per second
increased to 85
Thnx
Niv

On Sunday, September 14, 2014 1:29:42 PM UTC+3, Itamar Syn-Hershko wrote:

A couple of suggestions:

  1. Disable replicas before large amounts of inserts (set replica count to
    0), and only enable it afterwards again.

  2. Use batching
    http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html,
    actual batch size would depends on many factors (doc sizes, network,
    instances strengths)

  3. Follow ES's advice on node setup, e.g. allocate 50% of the available
    memory size to the Java heap of ES, don't run anything else on that
    machine, and disable swappiness.

  4. Your index is already sharded, try spreading it out to 3 different
    servers instead of having them on one server ("virtual shards"). This will
    help fan out the indexing load.

  5. If you don't specify the document IDs yourself, make sure you use the
    latest ES, there's a significant improvement there in the ID generation
    mechanism which could help speeding up things.

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Sun, Sep 14, 2014 at 11:38 AM, Niv Penso ni...@toonimo.com wrote:

Hey,
I am trying to migrate (copy) 35 million documents (which is a standard
amount, not too big) between couchbase to elasticsearch.

My elasticsearch cluster composed from 3 A3 (4 cores, 7 GB memory)
CentOS Severs on Microsoft Azure (each server equals to a large server on
Amazon)..

I used "timing data flow" indexing to store the docuemnts. each index
represents a month and composed by 3 shards and 2 replicas.

when i start the migration script i see that the insertion time is
becoming very slow (about 10 documents per second) and the load average of
each server in the cluster jumping over than 1.5.
In addition, the JVM memory is being increased almost to 100% while the
cpu shows 20% and the IOps shows 20 at max.
(i used Marvel CNC to get all these data)

  1. Does anyone faced these kind of indexing problems in elasticsearch?
  2. I would like to know if there are any parameters that i should be
    aware about to extend java memory?
  3. is my cluster specifications good enough to handle 100 indexing per
    second.
  4. is the indexing time depends on how big is the index? and should it
    be that slow?

I also open a thread in stackover flow if anyone want to keep update:
azure - Elasticsearch bad indexing time - Stack Overflow

Thnx Niv

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/6aee6f5f-efcb-4e37-85ef-41af2ae89a39%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6aee6f5f-efcb-4e37-85ef-41af2ae89a39%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/37180829-62ff-4203-bfdf-fa47eeebf338%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/37180829-62ff-4203-bfdf-fa47eeebf338%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZtcwpoSUi1xmNtZDxOmfwALV0PH3n7C9Y%2B_4U9uHGSOsA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.