Setup:
4 nodes
Replication = 0
ES_HEAP_SIZE = 75GB
Number of Indices = 59 (using logstash one index per month)
Total shards = 234 (each index is 4 hards, one per node)
Total docs = 7.4 billion
Total size = 4.7TB
When I add a new file, which I do using logstash on all four nodes, the
indexing immediately throttles. For instance:
Where should I be looking to tuning the indexing performance? The query
load on the cluster is very low as it is a research cluster and so I would
sacrifice query performance for indexing.
The 4 nodes all run logstash, listening one various ports. I use netcat to
'feed' the data to the 4 nodes from a hadoop cluster.
Each ES node has 24 disks but I am only using one at the moment. This is an
obvious IO bottleneck, but I am unclear how to use all disks? If I add more
disks with ES share the data between them all? eg; /mnt/disk1 /mnt/disk2 etc
Try disabling merge IO throttling, especially if your index is on SSD/s.
(It's on by default at a paltry 20 MB/sec). Merge IO throttling causes
merges to run slowly which eventually causes them to back up enough to the
point where indexing must be throttled...
Also see the recent post about tuning to favor indexing throughput:
Setup:
4 nodes
Replication = 0
ES_HEAP_SIZE = 75GB
Number of Indices = 59 (using logstash one index per month)
Total shards = 234 (each index is 4 hards, one per node)
Total docs = 7.4 billion
Total size = 4.7TB
When I add a new file, which I do using logstash on all four nodes, the
indexing immediately throttles. For instance:
Where should I be looking to tuning the indexing performance? The query
load on the cluster is very low as it is a research cluster and so I would
sacrifice query performance for indexing.
The 4 nodes all run logstash, listening one various ports. I use netcat to
'feed' the data to the 4 nodes from a hadoop cluster.
Each ES node has 24 disks but I am only using one at the moment. This is
an obvious IO bottleneck, but I am unclear how to use all disks? If I add
more disks with ES share the data between them all? eg; /mnt/disk1
/mnt/disk2 etc
Try disabling merge IO throttling, especially if your index is on SSD/s.
(It's on by default at a paltry 20 MB/sec). Merge IO throttling causes
merges to run slowly which eventually causes them to back up enough to the
point where indexing must be throttled...
Setup:
4 nodes
Replication = 0
ES_HEAP_SIZE = 75GB
Number of Indices = 59 (using logstash one index per month)
Total shards = 234 (each index is 4 hards, one per node)
Total docs = 7.4 billion
Total size = 4.7TB
When I add a new file, which I do using logstash on all four nodes, the
indexing immediately throttles. For instance:
Where should I be looking to tuning the indexing performance? The query
load on the cluster is very low as it is a research cluster and so I would
sacrifice query performance for indexing.
The 4 nodes all run logstash, listening one various ports. I use netcat
to 'feed' the data to the 4 nodes from a hadoop cluster.
Each ES node has 24 disks but I am only using one at the moment. This is
an obvious IO bottleneck, but I am unclear how to use all disks? If I add
more disks with ES share the data between them all? eg; /mnt/disk1
/mnt/disk2 etc
Good point on heap, so I will bring that back down to 30GB
Versions:
ES 1.3.2-1
java 1.7.0_67
I definitely want to start using all 12 disks, rather than the 1 at the
moment! If I add paths for the other 11 disks and restart, will ES do any
'rebalancing'? If it won't then is there any way to move the data around
all 12 disks? I really don't want to re-index everthing!!
Thanks
On Thursday, September 18, 2014 10:03:18 AM UTC+1, Mark Walkom wrote:
Also given you're over 32GB heap your java pointers aren't going to be
compressed, which means GC will suffer.
You haven't mentioned what ES and java versions you are using, which would
be useful.
On 18 September 2014 18:57, Michael McCandless <mi...@elasticsearch.com
<javascript:>> wrote:
Try disabling merge IO throttling, especially if your index is on SSD/s.
(It's on by default at a paltry 20 MB/sec). Merge IO throttling causes
merges to run slowly which eventually causes them to back up enough to the
point where indexing must be throttled...
On Thu, Sep 18, 2014 at 4:54 AM, <bob.w...@gmail.com <javascript:>>
wrote:
Setup:
4 nodes
Replication = 0
ES_HEAP_SIZE = 75GB
Number of Indices = 59 (using logstash one index per month)
Total shards = 234 (each index is 4 hards, one per node)
Total docs = 7.4 billion
Total size = 4.7TB
When I add a new file, which I do using logstash on all four nodes, the
indexing immediately throttles. For instance:
Where should I be looking to tuning the indexing performance? The query
load on the cluster is very low as it is a research cluster and so I would
sacrifice query performance for indexing.
The 4 nodes all run logstash, listening one various ports. I use netcat
to 'feed' the data to the 4 nodes from a hadoop cluster.
Each ES node has 24 disks but I am only using one at the moment. This is
an obvious IO bottleneck, but I am unclear how to use all disks? If I add
more disks with ES share the data between them all? eg; /mnt/disk1
/mnt/disk2 etc
Good point on heap, so I will bring that back down to 30GB
Versions:
ES 1.3.2-1
java 1.7.0_67
I definitely want to start using all 12 disks, rather than the 1 at the
moment! If I add paths for the other 11 disks and restart, will ES do any
'rebalancing'? If it won't then is there any way to move the data around
all 12 disks? I really don't want to re-index everthing!!
Thanks
On Thursday, September 18, 2014 10:03:18 AM UTC+1, Mark Walkom wrote:
Also given you're over 32GB heap your java pointers aren't going to be
compressed, which means GC will suffer.
You haven't mentioned what ES and java versions you are using, which
would be useful.
Try disabling merge IO throttling, especially if your index is on SSD/s.
(It's on by default at a paltry 20 MB/sec). Merge IO throttling causes
merges to run slowly which eventually causes them to back up enough to the
point where indexing must be throttled...
Setup:
4 nodes
Replication = 0
ES_HEAP_SIZE = 75GB
Number of Indices = 59 (using logstash one index per month)
Total shards = 234 (each index is 4 hards, one per node)
Total docs = 7.4 billion
Total size = 4.7TB
When I add a new file, which I do using logstash on all four nodes, the
indexing immediately throttles. For instance:
Where should I be looking to tuning the indexing performance? The query
load on the cluster is very low as it is a research cluster and so I would
sacrifice query performance for indexing.
The 4 nodes all run logstash, listening one various ports. I use netcat
to 'feed' the data to the 4 nodes from a hadoop cluster.
Each ES node has 24 disks but I am only using one at the moment. This
is an obvious IO bottleneck, but I am unclear how to use all disks? If I
add more disks with ES share the data between them all? eg; /mnt/disk1
/mnt/disk2 etc
I have now enabled all 12 disks per machine, so going forward I will get
some "sharing" across all disks. Not sure how it will allocate new data
across the disks?
If I move a shard from one node to another with the new 12-disk paths, will
the receiving node "share" the data across the disks? That way I could move
all shards and get a redistribution of existing data?
On Thursday, September 18, 2014 10:35:24 AM UTC+1, Mark Walkom wrote:
Good point on heap, so I will bring that back down to 30GB
Versions:
ES 1.3.2-1
java 1.7.0_67
I definitely want to start using all 12 disks, rather than the 1 at the
moment! If I add paths for the other 11 disks and restart, will ES do any
'rebalancing'? If it won't then is there any way to move the data around
all 12 disks? I really don't want to re-index everthing!!
Thanks
On Thursday, September 18, 2014 10:03:18 AM UTC+1, Mark Walkom wrote:
Also given you're over 32GB heap your java pointers aren't going to be
compressed, which means GC will suffer.
You haven't mentioned what ES and java versions you are using, which
would be useful.
Try disabling merge IO throttling, especially if your index is on
SSD/s. (It's on by default at a paltry 20 MB/sec). Merge IO throttling
causes merges to run slowly which eventually causes them to back up enough
to the point where indexing must be throttled...
Setup:
4 nodes
Replication = 0
ES_HEAP_SIZE = 75GB
Number of Indices = 59 (using logstash one index per month)
Total shards = 234 (each index is 4 hards, one per node)
Total docs = 7.4 billion
Total size = 4.7TB
When I add a new file, which I do using logstash on all four nodes,
the indexing immediately throttles. For instance:
Where should I be looking to tuning the indexing performance? The
query load on the cluster is very low as it is a research cluster and so I
would sacrifice query performance for indexing.
The 4 nodes all run logstash, listening one various ports. I use
netcat to 'feed' the data to the 4 nodes from a hadoop cluster.
Each ES node has 24 disks but I am only using one at the moment. This
is an obvious IO bottleneck, but I am unclear how to use all disks? If I
add more disks with ES share the data between them all? eg; /mnt/disk1
/mnt/disk2 etc
I have now enabled all 12 disks per machine, so going forward I will get
some "sharing" across all disks. Not sure how it will allocate new data
across the disks?
If I move a shard from one node to another with the new 12-disk paths,
will the receiving node "share" the data across the disks? That way I could
move all shards and get a redistribution of existing data?
On Thursday, September 18, 2014 10:35:24 AM UTC+1, Mark Walkom wrote:
Good point on heap, so I will bring that back down to 30GB
Versions:
ES 1.3.2-1
java 1.7.0_67
I definitely want to start using all 12 disks, rather than the 1 at the
moment! If I add paths for the other 11 disks and restart, will ES do any
'rebalancing'? If it won't then is there any way to move the data around
all 12 disks? I really don't want to re-index everthing!!
Thanks
On Thursday, September 18, 2014 10:03:18 AM UTC+1, Mark Walkom wrote:
Also given you're over 32GB heap your java pointers aren't going to be
compressed, which means GC will suffer.
You haven't mentioned what ES and java versions you are using, which
would be useful.
Try disabling merge IO throttling, especially if your index is on
SSD/s. (It's on by default at a paltry 20 MB/sec). Merge IO throttling
causes merges to run slowly which eventually causes them to back up enough
to the point where indexing must be throttled...
Setup:
4 nodes
Replication = 0
ES_HEAP_SIZE = 75GB
Number of Indices = 59 (using logstash one index per month)
Total shards = 234 (each index is 4 hards, one per node)
Total docs = 7.4 billion
Total size = 4.7TB
When I add a new file, which I do using logstash on all four nodes,
the indexing immediately throttles. For instance:
Where should I be looking to tuning the indexing performance? The
query load on the cluster is very low as it is a research cluster and so I
would sacrifice query performance for indexing.
The 4 nodes all run logstash, listening one various ports. I use
netcat to 'feed' the data to the 4 nodes from a hadoop cluster.
Each ES node has 24 disks but I am only using one at the moment. This
is an obvious IO bottleneck, but I am unclear how to use all disks? If I
add more disks with ES share the data between them all? eg; /mnt/disk1
/mnt/disk2 etc
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.