Offline indexing and expected scaling performance

simonw_2 · October 14, 2012, 1:06pm

hey hadar,

On Saturday, October 13, 2012 11:21:12 PM UTC+2, Hadar Rottenberg wrote:

I was wondering what's the best approach for indexing existing data which
is not expected to change.
Also I would like to understand better how sharding and indexing happens,
so I hope someone can help by answering my questions.

do you have already speed problems or are you just worried ahead of time?
The basic procedure is you send data to any node and it figures out to
which node it needs to go and forwards your request based on your routing
key (ID by default). If you want to speed this up you can make request
async as well as replication (see the async part of this document
Elasticsearch Platform — Find real-time answers at scale | Elastic -->
asynchronous replication & write consistency)

I've read that lucene can reach 100gb/hour indexing wikipedia on
standard hardware, is elasticsearch expected to scale linearly?

depends on what you call std hardward. But yes 100gb/hour is possible with
lucene. I have seen 300GB and even more with lucene 4.0 and concurrent
flushing.

When performing indexing can you send data to index to all nodes?or do

you send it only to the master node which then distribute the documents
according to the selected routing hash?

you can send your data to any node, ES will figure it out where it needs to
go.

when bulk indexing is the data also queued on the indexing node until X
documents arrive?

I am not sure what you are referring to here?

If shard location of documents is not important is it possible to
simply give each node part of the data to index localy?

you can just use a local node by using the java API that will figure out
which node it needs to go and safes one hop on the way to the right server.

is it possible to pre-sort the documents according to the node routing
hash and then give the documents to each node locally? My idea is to use
amazon EMR for the initial indexing since it is much cheaper then EC2,

I'm not sure if it is worth the trouble, you can scale out horizontally to
get more indexing speed?

is there a way to accomplish this? meaning without running elasticsearch
cloud, but simply creating lucene indexes which would then be used as
elasticsearch shards?

in theory this is possible, again I'd don't think its worth the trouble.
maybe you can tell me more about your concerns in terms of indexing speed?

What kind of hashing algorithm does elasticsearch uses to decide on the
document shard?

it uses a hash function of the DJB family.

Thanks
Hadar

--

Topic		Replies	Views
How to deal with building huge bulk load indices fast without impacting prod queries or paying a fortune to over-provision the cluster Elasticsearch	10	3686	July 5, 2017
Is shard splitting supported in Elastic search, any alternate Elasticsearch	9	463	July 6, 2017
Scaling strategies without shard splitting Elasticsearch	4	679	July 6, 2017
Recommended Hardware Specs & Sharding\Index Strategy Elasticsearch	13	838	July 6, 2017
What are the research papers that ES relies on? Elasticsearch	8	3419	July 6, 2017

Offline indexing and expected scaling performance

Related topics