More shards (in same node) makes indexing much slower

Hi

We have a simple test that indexes 100.000 documents in the same index
in ES and measures how long time it takes. We always run against an ES
with one node. We have tried to vary the number of shards in the
index, while running the exact same test on the same machine. I have
to say that I am surprised to see that the time it takes gets
significantly worse as we rise the number of shards.

1 shard: 64 secs
5 shard: 65 secs
10 shard: 69 secs
50 shard: 96 secs
100 shard: 149 secs
200 shard: 305 secs
500 shard: 526 secs

I would have expected the time to get very little worse when rising
the number of shards - probably a little more administration is
necessary. But I did not expect this kind of behavior, where the time
seems to grow linearly in the number of shards (time = 60 + #shards
secs). Is this expected behavior? Please try to explain why a
significant rise in time cannot be avoided, if that is the case. I
cannot find a logical reasoning.

Regards, Per Steffensen

Hmmm, this was written directly on google-groups web-page. Why doesnt
it arrive as a mail on the mailing-list?

On Oct 3, 11:53 pm, Steff st...@designware.dk wrote:

Hi

We have a simple test that indexes 100.000 documents in the same index
in ES and measures how long time it takes. We always run against an ES
with one node. We have tried to vary the number of shards in the
index, while running the exact same test on the same machine. I have
to say that I am surprised to see that the time it takes gets
significantly worse as we rise the number of shards.

1 shard: 64 secs
5 shard: 65 secs
10 shard: 69 secs
50 shard: 96 secs
100 shard: 149 secs
200 shard: 305 secs
500 shard: 526 secs

I would have expected the time to get very little worse when rising
the number of shards - probably a little more administration is
necessary. But I did not expect this kind of behavior, where the time
seems to grow linearly in the number of shards (time = 60 + #shards
secs). Is this expected behavior? Please try to explain why a
significant rise in time cannot be avoided, if that is the case. I
cannot find a logical reasoning.

Regards, Per Steffensen

Each shard in a fully functional Lucene index. Even a single shard support
concurrent indexing and can provide very good performance when it comes to
index time (and can potentially hold large dataset, depending on your app).

When you create many shards on a single box, and constantly index into all
the shards, it will tax your machine more. There are things that happen in
the background, things like merging index segments (internal structures of
how the search index is composed of), and, having more shards and constantly
indexing into them means more of those operations happening (and more
concurrent IO).

Btw, how do you index the data? If it using the index API, or maybe the bulk
API? How many concurrent indexing processes are you running against the
node?

On Mon, Oct 3, 2011 at 11:53 PM, Steff steff@designware.dk wrote:

Hi

We have a simple test that indexes 100.000 documents in the same index
in ES and measures how long time it takes. We always run against an ES
with one node. We have tried to vary the number of shards in the
index, while running the exact same test on the same machine. I have
to say that I am surprised to see that the time it takes gets
significantly worse as we rise the number of shards.

1 shard: 64 secs
5 shard: 65 secs
10 shard: 69 secs
50 shard: 96 secs
100 shard: 149 secs
200 shard: 305 secs
500 shard: 526 secs

I would have expected the time to get very little worse when rising
the number of shards - probably a little more administration is
necessary. But I did not expect this kind of behavior, where the time
seems to grow linearly in the number of shards (time = 60 + #shards
secs). Is this expected behavior? Please try to explain why a
significant rise in time cannot be avoided, if that is the case. I
cannot find a logical reasoning.

Regards, Per Steffensen

On Oct 4, 1:06 pm, Shay Banon kim...@gmail.com wrote:

Each shard in a fully functional Lucene index. Even a single shard support
concurrent indexing and can provide very good performance when it comes to
index time (and can potentially hold large dataset, depending on your app).

Yes that was also my impression.

When you create many shards on a single box, and constantly index into all
the shards, it will tax your machine more.

The combined amount of data in the index and the combined amount of
work (index-operations) done is the same in all tests. A certain
amount of work should not tax my machine (much) more, just because
there are more shards - I cannot see the logic in that, and consider
it a scalability problem (for now)

There are things that happen in
the background, things like merging index segments (internal structures of
how the search index is composed of), and, having more shards and constantly
indexing into them means more of those operations happening (and more
concurrent IO).

Again the combined amount of work should be the same. With more
shards, each shard will grow more slowly (as a function of documents
indexed all-in-all) and therefore the "things happening in the
background" should happen more seldom for each index or at least have
less work to do.

Btw, how do you index the data? If it using the index API, or maybe the bulk
API?

IndexRequestBuilder irb = client.prepareIndex(indexName,
ESInterfaceSettings.INDEX_TYPE, doc.getId()).setOpType((update) ?
OpType.INDEX : OpType.CREATE).setSource(doc.getESSource());
if (updatable) irb.setRouting(doc.getRouting());
if (update) irb.setVersion(doc.getVersion());
irb.execute().actionGet();

How many concurrent indexing processes are you running against the
node?

1

On Mon, Oct 3, 2011 at 11:53 PM, Steff st...@designware.dk wrote:

Hi

We have a simple test that indexes 100.000 documents in the same index
in ES and measures how long time it takes. We always run against an ES
with one node. We have tried to vary the number of shards in the
index, while running the exact same test on the same machine. I have
to say that I am surprised to see that the time it takes gets
significantly worse as we rise the number of shards.

1 shard: 64 secs
5 shard: 65 secs
10 shard: 69 secs
50 shard: 96 secs
100 shard: 149 secs
200 shard: 305 secs
500 shard: 526 secs

I would have expected the time to get very little worse when rising
the number of shards - probably a little more administration is
necessary. But I did not expect this kind of behavior, where the time
seems to grow linearly in the number of shards (time = 60 + #shards
secs). Is this expected behavior? Please try to explain why a
significant rise in time cannot be avoided, if that is the case. I
cannot find a logical reasoning.

Regards, Per Steffensen

Have done some profiling while running the test in two setups:
a) 1 node with 1 index with 1 shard
b) 1 node with 1 index with 50 shards

I have narrowed the places in the code where most (by far) time is spent -
it is in RobinEngine.createInner. I have done specific profiling on that
method.

Test a) give me the following profiling information:

https://lh3.googleusercontent.com/-xlTpQOJlUM4/TotjVxU4eBI/AAAAAAAAABM/7VjalpS39Qw/oneshard.png

It shows that almost all the time used inside RobinEngine.innerCreate is
used by calls to IndexWriter.addDocument,
RobinEngine.loadCurrentVersionFromIndex and Translog.add - a.o. that one
million calls to IndexWriter.addDocument has used 118 secs. My test stops
after indexing one million documents.

Test b) give me the following profiling information:

https://lh4.googleusercontent.com/-75-QACEXZH4/Totjj7I17AI/AAAAAAAAABU/2W39gfOnOEk/fiftyshards.png

It shows that this time 439309 calls to IndexWriter.addDocument has used 723
secs - thats a lot slower per addDocument call than test a) above. You also
notice that RobinEngine.loadCurrentVersionFromIndex and Translog.add is a
lot slower per call.

Regards, Per Steffensen

Curious, how do things look if you set index op_type to create, as
discussed here:

I believe that should mitigate any time spent in
loadCurrentVersionFromIndex.

With the default “put-if-absent” behavior a search needs to be done
across all the indices for every document submitted. From my (limited)
understanding that should be the main overhead of having so many
shards.

Also, just curious, what method did you use for profiling? I'm not too
versed in this realm and got pretty fed up when I tried to use the
perfanal tool.

Regards,
Paul

On Oct 4, 1:57 pm, Steff st...@designware.dk wrote:

Have done some profiling while running the test in two setups:
a) 1 node with 1 index with 1 shard
b) 1 node with 1 index with 50 shards

I have narrowed the places in the code where most (by far) time is spent -
it is in RobinEngine.createInner. I have done specific profiling on that
method.

Test a) give me the following profiling information:

https://lh3.googleusercontent.com/-xlTpQOJlUM4/TotjVxU4eBI/AAAAAAAAAB...

It shows that almost all the time used inside RobinEngine.innerCreate is
used by calls to IndexWriter.addDocument,
RobinEngine.loadCurrentVersionFromIndex and Translog.add - a.o. that one
million calls to IndexWriter.addDocument has used 118 secs. My test stops
after indexing one million documents.

Test b) give me the following profiling information:

https://lh4.googleusercontent.com/-75-QACEXZH4/Totjj7I17AI/AAAAAAAAAB...

It shows that this time 439309 calls to IndexWriter.addDocument has used 723
secs - thats a lot slower per addDocument call than test a) above. You also
notice that RobinEngine.loadCurrentVersionFromIndex and Translog.add is a
lot slower per call.

Regards, Per Steffensen

Thanks for your interest!

Den tirsdag den 4. oktober 2011 23.42.43 UTC+2 skrev ppearcy:

Curious, how do things look if you set index op_type to create, as
discussed here:
Elasticsearch Platform — Find real-time answers at scale | Elastic

I believe that should mitigate any time spent in
loadCurrentVersionFromIndex.

Do I understand you right, that using op_type create is cheaper than using
op_type index?

I will try it out, but I guess getting rid of loadCurrentVersionFromIndex
does not change the fact that things are running much slower with 50 shards
than it does with 1 shard - even though the amount of work (number of
index-operations) is the same.

With the default “put-if-absent” behavior a search needs to be done
across all the indices for every document submitted. From my (limited)
understanding that should be the main overhead of having so many
shards.

Looking at my profiling it is clear that loadCurrentVersionFromIndex takes
its share of the reason that it is much slower with 50 shards than with 1
shard, but its is not the full explanation - IndexWriter.addDocument and
Translog.add is also much slower in average per call.

Also, just curious, what method did you use for profiling? I'm not too
versed in this realm and got pretty fed up when I tried to use the
perfanal tool.

Earlier in my career I was employed in a small (well it was at that time)
danish company called Trifork (www.trifork.com). A.o. we had/implemented our
own JEE server (called T4) and our own profiler (called P4). I know those
tools very very well :slight_smile: Now I am not working at Trifork anymore, but I
still prefer using P4 as my profiling tool (at least for time-profiling -
for memory-profiling I do not always prefer the Trifork tool (called L4,
which is now a subpart of P4)). P4 is very configurable wrt exactly what you
want to profile, and a.o. therefore you can run with profiling without it
actually disturbes the application being profiled - thats nice.

Basically, in my last profiling, I just started my ES node with a P4 agent
inside, which was asked to only profile the RobinEngine.innerCreate method.
While the test was running I connected from my P4 viewer (just running on my
local Mac) to the P4 agent inside the ES node JVM (running on an external
server) and took a few snapshots. Then its just about using the viewer to
see where time is spent.

I can tell you a little bit more about how I did, if you want to do it
yourself, but Trifork is also very helpfull. P4 is a commercial product by
the way, but I believe you can have a one month trial.

Regards,
Paul

On 5 Okt., 09:42, Steff st...@designware.dk wrote:

Thanks for your interest!

Den tirsdag den 4. oktober 2011 23.42.43 UTC+2 skrev ppearcy:

Curious, how do things look if you set index op_type to create, as
discussed here:
Elasticsearch Platform — Find real-time answers at scale | Elastic

I believe that should mitigate any time spent in
loadCurrentVersionFromIndex.

Do I understand you right, that using op_type create is cheaper than using
op_type index?

I will try it out, but I guess getting rid of loadCurrentVersionFromIndex
does not change the fact that things are running much slower with 50 shards
than it does with 1 shard - even though the amount of work (number of
index-operations) is the same.

Hmm. I looked at my code and realized that I am actually already using
op_type create, so I guess the time spent in
loadCurrentVersionFromIndex is already mitigated :slight_smile:

https://lh6.googleusercontent.com/-r79XDfMJ_Eg/TpLig8x8u2I/AAAAAAAAABc/O-cqeWzXOug/insert_by_shards.png
Hi again

The "problem" is two-fold:
a) Indexing 100.000 documents into an empty index, gets slower and slower
the more shards are in the index
b) Indexing 100.000 documents into an index already containing X documents,
gets slower and slower the more documents (X) are already in the index

I see no reason why a) should be true, and (as I mentioned before) I
consider it a scaling problem - it is not a fatal scaling problem because
you in most cases can work around it, it just puts some restrictions on how
to use ES, that I am sure was not intended.
I would expect b) to be true - that the indexing time grew logarithmicly in
the number of documents already in the index/shard. But it appears to grow
way to fast - linearly in the number of documents already in the
index/shard.

Please see this graph above. Its shows for 10 consecutive runs, where we
index 100.000 documents, how long time it took. So "1. run" is where there
is no documents in ES already, "2. run" is where there is already 100.000
documents in ES, "3. run" is where there is already 200.000 documents in ES
etc. Number of shards is on the X-axis and time spent is on the Y-axis.
It is a little hard to see, but for the runs with 5 and 10 shards in the
index, neither a) nor b) seems to be true. Both a) and b) only becomes a
problem with 20+ shards. A funny observation is that it is "worse" with 50
shards than it is with 100, 200 and 500 shards. Another observation is the
problems a) and b) seems to stabilize after 200 shards - not growing even
worse when increasing the number of shards further.

Regards, Per Steffensen