Bulk indexing via ES HTTP


(Luca Belluccini) #1

Hello,
I am putting in place an ES cluster with 4 nodes (6 Cores + 48GB RAM).
The aim is to use Kibana as a data analysis tool.
I set up Logstash to properly feed ES and use the following:

  • https://gist.github.com/lucabelluccini/7563998 for index templates
  • Some tweaks to elasticsearch.yml:
    • indices.memory.index_buffer_size: 50%
    • index.translog.flush_threshold_ops: 50000
    • index.number_of_shards: 3
    • threadpool.search.type: fixed
    • threadpool.search.size: 20
    • threadpool.search.queue_size: 100
    • threadpool.index.type: fixed
    • threadpool.index.size: 60
    • threadpool.index.queue_size: 200
    • node.master: true
    • node.data: true
    • ES_HEAP_SIZE=30g

Logstash is sending to one of the hosts and I wanted to ask if the indexing
is automatically distributed over all the nodes or you have to set up
something to exploit all the processing power of all the 4 nodes.

Thanks in advance,
Luca B.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/15f9d547-0d78-48bb-bb33-c18d88e78687%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alexander Reelsen) #2

Hey,

might make sense to use only half of your memory to the ES process, the
rest is used by the filesystem cache to speed up index operations. You may
want to enable bootstrap.mlockall and ensure it is working. I guess you are
using indices per day/week so that it does not matter too much that you are
using less shards than you have nodes per index. Three shards per index
means that those three primary shards are distributed across your cluster,
which implies that your indexing is also distributed.

Wondering about your threadpool changes and the translog stuff. Any
particular reason? Did you run into something while testing?

--Alex

On Tue, Jan 28, 2014 at 9:51 AM, Luca Belluccini
lucabelluccini@gmail.comwrote:

Hello,
I am putting in place an ES cluster with 4 nodes (6 Cores + 48GB RAM).
The aim is to use Kibana as a data analysis tool.
I set up Logstash to properly feed ES and use the following:

  • https://gist.github.com/lucabelluccini/7563998 for index templates
  • Some tweaks to elasticsearch.yml:
    • indices.memory.index_buffer_size: 50%
    • index.translog.flush_threshold_ops: 50000
    • index.number_of_shards: 3
    • threadpool.search.type: fixed
    • threadpool.search.size: 20
    • threadpool.search.queue_size: 100
    • threadpool.index.type: fixed
    • threadpool.index.size: 60
    • threadpool.index.queue_size: 200
    • node.master: true
    • node.data: true
    • ES_HEAP_SIZE=30g

Logstash is sending to one of the hosts and I wanted to ask if the
indexing is automatically distributed over all the nodes or you have to set
up something to exploit all the processing power of all the 4 nodes.

Thanks in advance,
Luca B.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/15f9d547-0d78-48bb-bb33-c18d88e78687%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM_u9W_Xh4kuX3Z3oExydm%2BH%2B%3Db4%3DtYEi4CN1BkRJ7iBrQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #3

You should consider if it is possible to install the latest ES (current
1.0.0.RC1) and the latest JVM.

If you use 4 nodes, you should consider 4 shards by default, for balanced
resources on every index.

You did not set anything special for bulk indexing thread pool if you mean
that, the settings are in threadpool.bulk, not threadpool.index (I don't
know if your Logstash is using bulk or index)

indices.memory.index_buffer_size is adjusted automatically, no need to cap
it to 50%.

Also index.translog.flush_threshold_ops, I wonder why you adjust that value.

By moving the search pool away from the number of CPU cores, you reduce the
automatic scale of search in your cluster which is bad. Using 20 instead of
18 (3*6 is default) makes not much difference per se. But reducing the
queue size from 1000 to 100 will make your search load bail out early and
often.

Your heap size is very large (30g) and you should be prepared that you have
to take additional efforts to tackle GC challenges.

You should also think about dedicated master nodes if you want to drive
large heaps with expected high GC on data nodes.

The indexing load is automatically distributed, no need to care for that in
Logstash. But you should consider to set up Logstash so that it can index
to more than one node, just for more resiliency.

Jörg

On Tue, Jan 28, 2014 at 9:51 AM, Luca Belluccini
lucabelluccini@gmail.comwrote:

Hello,
I am putting in place an ES cluster with 4 nodes (6 Cores + 48GB RAM).
The aim is to use Kibana as a data analysis tool.
I set up Logstash to properly feed ES and use the following:

  • https://gist.github.com/lucabelluccini/7563998 for index templates
  • Some tweaks to elasticsearch.yml:
    • indices.memory.index_buffer_size: 50%
    • index.translog.flush_threshold_ops: 50000
    • index.number_of_shards: 3
    • threadpool.search.type: fixed
    • threadpool.search.size: 20
    • threadpool.search.queue_size: 100
    • threadpool.index.type: fixed
    • threadpool.index.size: 60
    • threadpool.index.queue_size: 200
    • node.master: true
    • node.data: true
    • ES_HEAP_SIZE=30g

Logstash is sending to one of the hosts and I wanted to ask if the
indexing is automatically distributed over all the nodes or you have to set
up something to exploit all the processing power of all the 4 nodes.

Thanks in advance,
Luca B.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/15f9d547-0d78-48bb-bb33-c18d88e78687%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE8tvmcty8HMxyJogOWW-L5wL%3D3sQjtuR-A3r8o1r%2BwCg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #4