NRT indexing speed


(Crwe) #1

I plan to use the NRT feature heavily, for near-real-time indexing of
documents, let's say adding 1,000 documents at a time via bulk index.
I'd like to know:

Does the bulk indexing performance depend on the size of the index
(no. documents already indexed)? If so, linearly or sub-linearly?

Does the indexing performance depend on the number of shards (nodes)?
Linearly?

Does the indexing performance depend on the number of replicas?
Linearly?

Should I expect significant spikes in indexing latency (because of
some internal ES re-allocations, merges, etc.)? What is a "normal"
standard deviation of the time between "start bulk index" and "can
query the new docs in search"?

Sorry for my basic questions, I do not know Lucene well. Thank you.


(Crwe) #2

Bump.

Any insights on scalability of NRT indexing in ElasticSearch are most
welcome.

On May 21, 6:07 pm, Crwe tester.teste...@gmail.com wrote:

I plan to use the NRT feature heavily, for near-real-time indexing of
documents, let's say adding 1,000 documents at a time via bulk index.
I'd like to know:

Does the bulk indexing performance depend on the size of the index
(no. documents already indexed)? If so, linearly or sub-linearly?

Does the indexing performance depend on the number of shards (nodes)?
Linearly?

Does the indexing performance depend on the number of replicas?
Linearly?

Should I expect significant spikes in indexing latency (because of
some internal ES re-allocations, merges, etc.)? What is a "normal"
standard deviation of the time between "start bulk index" and "can
query the new docs in search"?

Sorry for my basic questions, I do not know Lucene well. Thank you.


(Shay Banon) #3

Kindda hard to answer this general questions. You are asking question
beyond NRT, so I will first answer NRT, what happens is that periodically,
the index of a shard is refreshed and makes recent changes available for
search (by default, its 1 second). Other question might be answered by
watching the following video as a starter to understand how ES works:
http://www.elasticsearch.org/videos/2011/08/09/road-to-a-distributed-searchengine-berlinbuzzwords.html
.

On Mon, May 21, 2012 at 6:07 PM, Crwe tester.testerus@gmail.com wrote:

I plan to use the NRT feature heavily, for near-real-time indexing of
documents, let's say adding 1,000 documents at a time via bulk index.
I'd like to know:

Does the bulk indexing performance depend on the size of the index
(no. documents already indexed)? If so, linearly or sub-linearly?

Does the indexing performance depend on the number of shards (nodes)?
Linearly?

Does the indexing performance depend on the number of replicas?
Linearly?

Should I expect significant spikes in indexing latency (because of
some internal ES re-allocations, merges, etc.)? What is a "normal"
standard deviation of the time between "start bulk index" and "can
query the new docs in search"?

Sorry for my basic questions, I do not know Lucene well. Thank you.


(Crwe) #4

Thanks Shay, watched it.

In case someone comes across this thread later, here are the answers I
gathered from that video:

Does the bulk indexing performance depend on the size of the index
(no. documents already indexed)? If so, linearly or sub-linearly?

No.

Does the indexing performance depend on the number of shards (nodes)?
Linearly?

No. The performance is determined by indexing on the slowest shard,
but otherwise constant in the number of shards.

Does the indexing performance depend on the number of replicas?
Linearly?

Same as above.

Should I expect significant spikes in indexing latency (because of
some internal ES re-allocations, merges, etc.)? What is a "normal"
standard deviation of the time between "start bulk index" and "can
query the new docs in search"?

Yes. Lucene merges segments from time to time, there are commits and
stuff, plus there's Java's GC. I should expect large stddev of
indexing times and occasional lags.

Hopefully I didn't get it too wrong :slight_smile:


(system) #5