Understanding Threadpools

I found that I was hitting a wall, and no matter how much I increased my
thread-count in JMeter, the throughput remained constant. I was wondering
if this is because the elasticsearch node can only process a certain
number. I found out that tweaking the threadpool settings in the
elasticsearch.yml file also made no difference

threadpool:
index:
type: fixed
size: 100
queue_size: 10000
reject_policy: caller

Can someone give me a few pointers here

--

At some point you'll end up with contention at the shard level (blocked
threads waiting to write).

If you want to index faster, you need to parallelize it. Add more shards,
either on the existing nodes or by adding new nodes. For a frame of
reference, we ran some benchmarks on a 24 core machine. With 1 shard, we
got an average throughput of ~1200 docs/sec. With 20 shards on the same
machine we got ~22K docs/sec (near linear improvement).

Also, if possible, turn off replication when you're doing heavy indexing.
Elasticsearch is not "eventually consistent" (by default), index requests
don't return until the replicas have also indexed the document. If you
can't disable replication (maybe your feeds are real-time) then consider
using asynchronous replication.

Most folks only consider "scaling out" (i.e., adding more physical nodes)
but elasticsearch scales up just the same by increasing the shard
allocation per node. Of course, it helps to have a core/cpu for each shard.

On Friday, January 18, 2013 6:03:20 PM UTC-5, avins...@gmail.com wrote:

I found that I was hitting a wall, and no matter how much I increased my
thread-count in JMeter, the throughput remained constant. I was wondering
if this is because the elasticsearch node can only process a certain
number. I found out that tweaking the threadpool settings in the
elasticsearch.yml file also made no difference

threadpool:
index:
type: fixed
size: 100
queue_size: 10000
reject_policy: caller

Can someone give me a few pointers here

--

If you want to index faster, you need to parallelize it. Add more
shards, either on the existing nodes or by adding new nodes. For a
frame of reference, we ran some benchmarks on a 24 core machine. With
1 shard, we got an average throughput of ~1200 docs/sec. With 20
shards on the same machine we got ~22K docs/sec (near linear
improvement).

that's interesting - I thought that 1 shard would be able use all cores
on the machine. Although, I seem to remember hearing that with indexing
there is less parallelism. And I also seem to remember that in Lucene
4, indexing happens more in parallel than in 3.*.

Can anybody confirm?

clint

--

Hi Clint,

Correct. Simon W. is one of those bad people responsible for faster
indexing. He even gave a few presentations describing his work. Let me
find one... eh, couldn't quickly find it, but I found something with a
pretty graph that's worth at least a few dozen words and a couple of
slides: Changing Bits: 265% indexing speedup with Lucene's concurrent flushing

Otis

ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Saturday, January 19, 2013 5:05:10 AM UTC-5, Clinton Gormley wrote:

If you want to index faster, you need to parallelize it. Add more
shards, either on the existing nodes or by adding new nodes. For a
frame of reference, we ran some benchmarks on a 24 core machine. With
1 shard, we got an average throughput of ~1200 docs/sec. With 20
shards on the same machine we got ~22K docs/sec (near linear
improvement).

that's interesting - I thought that 1 shard would be able use all cores
on the machine. Although, I seem to remember hearing that with indexing
there is less parallelism. And I also seem to remember that in Lucene
4, indexing happens more in parallel than in 3.*.

Can anybody confirm?

clint

--

Hi,

I'm guessing this has to do with serialization that goes down below
IndexWriter. Going from 1 shard to 20 shards means going from 1
IndexWriter to 20 IWs. As long as there is sufficient IO and enough free
CPU cycles, this has to improve the throughput.

Eric - did you try going beyond 20 shards?

Otis

ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Friday, January 18, 2013 6:51:42 PM UTC-5, egaumer wrote:

At some point you'll end up with contention at the shard level (blocked
threads waiting to write).

If you want to index faster, you need to parallelize it. Add more shards,
either on the existing nodes or by adding new nodes. For a frame of
reference, we ran some benchmarks on a 24 core machine. With 1 shard, we
got an average throughput of ~1200 docs/sec. With 20 shards on the same
machine we got ~22K docs/sec (near linear improvement).

Also, if possible, turn off replication when you're doing heavy indexing.
Elasticsearch is not "eventually consistent" (by default), index requests
don't return until the replicas have also indexed the document. If you
can't disable replication (maybe your feeds are real-time) then consider
using asynchronous replication.

Most folks only consider "scaling out" (i.e., adding more physical nodes)
but elasticsearch scales up just the same by increasing the shard
allocation per node. Of course, it helps to have a core/cpu for each shard.

On Friday, January 18, 2013 6:03:20 PM UTC-5, avins...@gmail.com wrote:

I found that I was hitting a wall, and no matter how much I increased my
thread-count in JMeter, the throughput remained constant. I was wondering
if this is because the elasticsearch node can only process a certain
number. I found out that tweaking the threadpool settings in the
elasticsearch.yml file also made no difference

threadpool:
index:
type: fixed
size: 100
queue_size: 10000
reject_policy: caller

Can someone give me a few pointers here

--

Increasing shard count per node scales well as long as

  • the IO subsystem is not satisfied

  • the CPU can manage the load created by the increasing number of threads

The slowest part of the machine is the disk subsystem (if it's SSD or
spindle disk makes no big difference in that point). It means tuning the
disk I/O will give the highest impact on performance improvement.

Not only Lucene, but also the translog I/O adds up to the disk I/O with
local gateway.

The most interesting numbers are

  • the sustained total amount of data that can be indexed per time interval
    (MB per sec, not only the docs per sec) because this is determining the I/O
    throughput performance (sustained means, not just peak time for some
    seconds, but observed on average during a significant time where all nodes
    resources are fully utilized)

  • the load of the machine while the node is indexing (not CPU cycles)
    because this is significant for the overall workload throughput of the
    system

  • and the same while the node is being searched (or even both indexing and
    being searched)

On big machines, like HPC servers, these numbers scale much better than on
desktop machines.

To get shard numbers that can be compared, performance should be measured
against data volumes that utilize the machine to its fullest extent. That
is, if RAM is around 16 G and heap is 8 G, indexing 4 G of data may not
exploit all the machine's resources.

So it's really a hard task to make statements like "ES can manage 20+ or
100+ shards per node". It always depends. There is always a compromise.

Best regards,

Jörg

--

Yes, it's a factor of the hardware you're working with.

In our case, we had SSDs with 10GigE and 128GB of RAM with 24 cores. Even
with 20 shards (500GB total index size) the CPU load was under 50% and the
I/O and network subsystems were not saturated.

But in general, as Otis mentioned, more IndexWriters means more
parallelism. If you're indexing throughput isn't going up and none of your
resources are bound, then consider adding more shards. The concurrent
flushing in Lucene 4 should reduce contention on the IndexWriter.

On Monday, January 21, 2013 11:27:38 AM UTC-5, Jörg Prante wrote:

Increasing shard count per node scales well as long as

  • the IO subsystem is not satisfied

  • the CPU can manage the load created by the increasing number of threads

The slowest part of the machine is the disk subsystem (if it's SSD or
spindle disk makes no big difference in that point). It means tuning the
disk I/O will give the highest impact on performance improvement.

Not only Lucene, but also the translog I/O adds up to the disk I/O with
local gateway.

The most interesting numbers are

  • the sustained total amount of data that can be indexed per time interval
    (MB per sec, not only the docs per sec) because this is determining the I/O
    throughput performance (sustained means, not just peak time for some
    seconds, but observed on average during a significant time where all nodes
    resources are fully utilized)

  • the load of the machine while the node is indexing (not CPU cycles)
    because this is significant for the overall workload throughput of the
    system

  • and the same while the node is being searched (or even both indexing and
    being searched)

On big machines, like HPC servers, these numbers scale much better than on
desktop machines.

To get shard numbers that can be compared, performance should be measured
against data volumes that utilize the machine to its fullest extent. That
is, if RAM is around 16 G and heap is 8 G, indexing 4 G of data may not
exploit all the machine's resources.

So it's really a hard task to make statements like "ES can manage 20+ or
100+ shards per node". It always depends. There is always a compromise.

Best regards,

Jörg

--