I found that I was hitting a wall, and no matter how much I increased my
thread-count in JMeter, the throughput remained constant. I was wondering
if this is because the elasticsearch node can only process a certain
number. I found out that tweaking the threadpool settings in the
elasticsearch.yml file also made no difference
At some point you'll end up with contention at the shard level (blocked
threads waiting to write).
If you want to index faster, you need to parallelize it. Add more shards,
either on the existing nodes or by adding new nodes. For a frame of
reference, we ran some benchmarks on a 24 core machine. With 1 shard, we
got an average throughput of ~1200 docs/sec. With 20 shards on the same
machine we got ~22K docs/sec (near linear improvement).
Also, if possible, turn off replication when you're doing heavy indexing.
Elasticsearch is not "eventually consistent" (by default), index requests
don't return until the replicas have also indexed the document. If you
can't disable replication (maybe your feeds are real-time) then consider
using asynchronous replication.
Most folks only consider "scaling out" (i.e., adding more physical nodes)
but elasticsearch scales up just the same by increasing the shard
allocation per node. Of course, it helps to have a core/cpu for each shard.
I found that I was hitting a wall, and no matter how much I increased my
thread-count in JMeter, the throughput remained constant. I was wondering
if this is because the elasticsearch node can only process a certain
number. I found out that tweaking the threadpool settings in the
elasticsearch.yml file also made no difference
If you want to index faster, you need to parallelize it. Add more
shards, either on the existing nodes or by adding new nodes. For a
frame of reference, we ran some benchmarks on a 24 core machine. With
1 shard, we got an average throughput of ~1200 docs/sec. With 20
shards on the same machine we got ~22K docs/sec (near linear
improvement).
that's interesting - I thought that 1 shard would be able use all cores
on the machine. Although, I seem to remember hearing that with indexing
there is less parallelism. And I also seem to remember that in Lucene
4, indexing happens more in parallel than in 3.*.
Correct. Simon W. is one of those bad people responsible for faster
indexing. He even gave a few presentations describing his work. Let me
find one... eh, couldn't quickly find it, but I found something with a
pretty graph that's worth at least a few dozen words and a couple of
slides: Changing Bits: 265% indexing speedup with Lucene's concurrent flushing
On Saturday, January 19, 2013 5:05:10 AM UTC-5, Clinton Gormley wrote:
If you want to index faster, you need to parallelize it. Add more
shards, either on the existing nodes or by adding new nodes. For a
frame of reference, we ran some benchmarks on a 24 core machine. With
1 shard, we got an average throughput of ~1200 docs/sec. With 20
shards on the same machine we got ~22K docs/sec (near linear
improvement).
that's interesting - I thought that 1 shard would be able use all cores
on the machine. Although, I seem to remember hearing that with indexing
there is less parallelism. And I also seem to remember that in Lucene
4, indexing happens more in parallel than in 3.*.
I'm guessing this has to do with serialization that goes down below
IndexWriter. Going from 1 shard to 20 shards means going from 1
IndexWriter to 20 IWs. As long as there is sufficient IO and enough free
CPU cycles, this has to improve the throughput.
On Friday, January 18, 2013 6:51:42 PM UTC-5, egaumer wrote:
At some point you'll end up with contention at the shard level (blocked
threads waiting to write).
If you want to index faster, you need to parallelize it. Add more shards,
either on the existing nodes or by adding new nodes. For a frame of
reference, we ran some benchmarks on a 24 core machine. With 1 shard, we
got an average throughput of ~1200 docs/sec. With 20 shards on the same
machine we got ~22K docs/sec (near linear improvement).
Also, if possible, turn off replication when you're doing heavy indexing.
Elasticsearch is not "eventually consistent" (by default), index requests
don't return until the replicas have also indexed the document. If you
can't disable replication (maybe your feeds are real-time) then consider
using asynchronous replication.
Most folks only consider "scaling out" (i.e., adding more physical nodes)
but elasticsearch scales up just the same by increasing the shard
allocation per node. Of course, it helps to have a core/cpu for each shard.
I found that I was hitting a wall, and no matter how much I increased my
thread-count in JMeter, the throughput remained constant. I was wondering
if this is because the elasticsearch node can only process a certain
number. I found out that tweaking the threadpool settings in the
elasticsearch.yml file also made no difference
Increasing shard count per node scales well as long as
the IO subsystem is not satisfied
the CPU can manage the load created by the increasing number of threads
The slowest part of the machine is the disk subsystem (if it's SSD or
spindle disk makes no big difference in that point). It means tuning the
disk I/O will give the highest impact on performance improvement.
Not only Lucene, but also the translog I/O adds up to the disk I/O with
local gateway.
The most interesting numbers are
the sustained total amount of data that can be indexed per time interval
(MB per sec, not only the docs per sec) because this is determining the I/O
throughput performance (sustained means, not just peak time for some
seconds, but observed on average during a significant time where all nodes
resources are fully utilized)
the load of the machine while the node is indexing (not CPU cycles)
because this is significant for the overall workload throughput of the
system
and the same while the node is being searched (or even both indexing and
being searched)
On big machines, like HPC servers, these numbers scale much better than on
desktop machines.
To get shard numbers that can be compared, performance should be measured
against data volumes that utilize the machine to its fullest extent. That
is, if RAM is around 16 G and heap is 8 G, indexing 4 G of data may not
exploit all the machine's resources.
So it's really a hard task to make statements like "ES can manage 20+ or
100+ shards per node". It always depends. There is always a compromise.
Yes, it's a factor of the hardware you're working with.
In our case, we had SSDs with 10GigE and 128GB of RAM with 24 cores. Even
with 20 shards (500GB total index size) the CPU load was under 50% and the
I/O and network subsystems were not saturated.
But in general, as Otis mentioned, more IndexWriters means more
parallelism. If you're indexing throughput isn't going up and none of your
resources are bound, then consider adding more shards. The concurrent
flushing in Lucene 4 should reduce contention on the IndexWriter.
On Monday, January 21, 2013 11:27:38 AM UTC-5, Jörg Prante wrote:
Increasing shard count per node scales well as long as
the IO subsystem is not satisfied
the CPU can manage the load created by the increasing number of threads
The slowest part of the machine is the disk subsystem (if it's SSD or
spindle disk makes no big difference in that point). It means tuning the
disk I/O will give the highest impact on performance improvement.
Not only Lucene, but also the translog I/O adds up to the disk I/O with
local gateway.
The most interesting numbers are
the sustained total amount of data that can be indexed per time interval
(MB per sec, not only the docs per sec) because this is determining the I/O
throughput performance (sustained means, not just peak time for some
seconds, but observed on average during a significant time where all nodes
resources are fully utilized)
the load of the machine while the node is indexing (not CPU cycles)
because this is significant for the overall workload throughput of the
system
and the same while the node is being searched (or even both indexing and
being searched)
On big machines, like HPC servers, these numbers scale much better than on
desktop machines.
To get shard numbers that can be compared, performance should be measured
against data volumes that utilize the machine to its fullest extent. That
is, if RAM is around 16 G and heap is 8 G, indexing 4 G of data may not
exploit all the machine's resources.
So it's really a hard task to make statements like "ES can manage 20+ or
100+ shards per node". It always depends. There is always a compromise.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.