Offline indexing and expected scaling performance

I was wondering what's the best approach for indexing existing data which
is not expected to change.
Also I would like to understand better how sharding and indexing happens,
so I hope someone can help by answering my questions.

  1. I've read that lucene can reach 100gb/hour indexing wikipedia on
    standard hardware, is elasticsearch expected to scale linearly?
  2. When performing indexing can you send data to index to all nodes?or do
    you send it only to the master node which then distribute the documents
    according to the selected routing hash?
  3. when bulk indexing is the data also queued on the indexing node until X
    documents arrive?
  4. If shard location of documents is not important is it possible to simply
    give each node part of the data to index localy?
  5. is it possible to pre-sort the documents according to the node routing
    hash and then give the documents to each node locally? My idea is to use
    amazon EMR for the initial indexing since it is much cheaper then EC2,
    is there a way to accomplish this? meaning without running elasticsearch
    cloud, but simply creating lucene indexes which would then be used as
    elasticsearch shards?
  6. What kind of hashing algorithm does elasticsearch uses to decide on the
    document shard?

Thanks
Hadar

--

hey hadar,

On Saturday, October 13, 2012 11:21:12 PM UTC+2, Hadar Rottenberg wrote:

I was wondering what's the best approach for indexing existing data which
is not expected to change.
Also I would like to understand better how sharding and indexing happens,
so I hope someone can help by answering my questions.

do you have already speed problems or are you just worried ahead of time?
The basic procedure is you send data to any node and it figures out to
which node it needs to go and forwards your request based on your routing
key (ID by default). If you want to speed this up you can make request
async as well as replication (see the async part of this document
Elasticsearch Platform — Find real-time answers at scale | Elastic -->
asynchronous replication & write consistency)

  1. I've read that lucene can reach 100gb/hour indexing wikipedia on
    standard hardware, is elasticsearch expected to scale linearly?

depends on what you call std hardward. But yes 100gb/hour is possible with
lucene. I have seen 300GB and even more with lucene 4.0 and concurrent
flushing.

  1. When performing indexing can you send data to index to all nodes?or do

you send it only to the master node which then distribute the documents
according to the selected routing hash?

you can send your data to any node, ES will figure it out where it needs to
go.

  1. when bulk indexing is the data also queued on the indexing node until X
    documents arrive?

I am not sure what you are referring to here?

  1. If shard location of documents is not important is it possible to
    simply give each node part of the data to index localy?

you can just use a local node by using the java API that will figure out
which node it needs to go and safes one hop on the way to the right server.

  1. is it possible to pre-sort the documents according to the node routing
    hash and then give the documents to each node locally? My idea is to use
    amazon EMR for the initial indexing since it is much cheaper then EC2,

I'm not sure if it is worth the trouble, you can scale out horizontally to
get more indexing speed?

is there a way to accomplish this? meaning without running elasticsearch
cloud, but simply creating lucene indexes which would then be used as
elasticsearch shards?

in theory this is possible, again I'd don't think its worth the trouble.
maybe you can tell me more about your concerns in terms of indexing speed?

  1. What kind of hashing algorithm does elasticsearch uses to decide on the
    document shard?

it uses a hash function of the DJB family.

Thanks
Hadar

--

Thanks for the quick reply,
I've 20TB of compressed data(60TB uncompressed) I want to index, and i'm
planning which approach/software to use.
I don't need NRT since no new data will not be added frequently, so i'm
also considering katta which works by creating lucene shards beforehand via
mapreduce and then loading the shards to the search cluster.

  1. when bulk indexing is the data also queued on the indexing node until X
    documents arrive?

What I mean is, when sending documents to be indexed, they are distributed
to the shards, so if I send 1000 documents each shard gets around
10,indexing 10 documents each time on each node doesn't seem efficient but
if each node has a queue of documents to be indexed and the bulk indexing
happens at the node level it would be more efficient.

On Sunday, October 14, 2012 3:06:49 PM UTC+2, simonw wrote:

hey hadar,

On Saturday, October 13, 2012 11:21:12 PM UTC+2, Hadar Rottenberg wrote:

I was wondering what's the best approach for indexing existing data which
is not expected to change.
Also I would like to understand better how sharding and indexing happens,
so I hope someone can help by answering my questions.

do you have already speed problems or are you just worried ahead of time?
The basic procedure is you send data to any node and it figures out to
which node it needs to go and forwards your request based on your routing
key (ID by default). If you want to speed this up you can make request
async as well as replication (see the async part of this document
Elasticsearch Platform — Find real-time answers at scale | Elastic -->
asynchronous replication & write consistency)

  1. I've read that lucene can reach 100gb/hour indexing wikipedia on
    standard hardware, is elasticsearch expected to scale linearly?

depends on what you call std hardward. But yes 100gb/hour is possible with
lucene. I have seen 300GB and even more with lucene 4.0 and concurrent
flushing.

  1. When performing indexing can you send data to index to all nodes?or do

you send it only to the master node which then distribute the documents
according to the selected routing hash?

you can send your data to any node, ES will figure it out where it needs
to go.

  1. when bulk indexing is the data also queued on the indexing node until
    X documents arrive?

I am not sure what you are referring to here?

  1. If shard location of documents is not important is it possible to
    simply give each node part of the data to index localy?

you can just use a local node by using the java API that will figure out
which node it needs to go and safes one hop on the way to the right server.

  1. is it possible to pre-sort the documents according to the node routing
    hash and then give the documents to each node locally? My idea is to use
    amazon EMR for the initial indexing since it is much cheaper then EC2,

I'm not sure if it is worth the trouble, you can scale out horizontally to
get more indexing speed?

is there a way to accomplish this? meaning without running elasticsearch
cloud, but simply creating lucene indexes which would then be used as
elasticsearch shards?

in theory this is possible, again I'd don't think its worth the trouble.
maybe you can tell me more about your concerns in terms of indexing speed?

  1. What kind of hashing algorithm does elasticsearch uses to decide on
    the document shard?

it uses a hash function of the DJB family.

Thanks
Hadar

--

hey,

On Sunday, October 14, 2012 11:03:51 PM UTC+2, Hadar Rottenberg wrote:

Thanks for the quick reply,
I've 20TB of compressed data(60TB uncompressed) I want to index, and i'm
planning which approach/software to use.

that is a reasonable amount of data. I personally would make the decision
based on experiments and make sure you configure everything correctly to
make the best possible throughput. I will list some things to look at below.

I don't need NRT since no new data will not be added frequently, so i'm
also considering katta which works by creating lucene shards beforehand via
mapreduce and then loading the shards to the search cluster.

ok fair enough, you can also use hadoop to index into elasticsearch (here
is a github project to do so GitHub - infochimps-labs/wonderdog: Bulk loading for elastic search)

  1. when bulk indexing is the data also queued on the indexing node until X

documents arrive?

What I mean is, when sending documents to be indexed, they are distributed
to the shards, so if I send 1000 documents each shard gets around
10,indexing 10 documents each time on each node doesn't seem efficient but
if each node has a queue of documents to be indexed and the bulk indexing
happens at the node level it would be more efficient.

I am not sure if I understand your concern. If you index 1000 documents the
documents are "routed" to the individual shards and a document for shard X
is only indexed on shard X. If you set you replication level to 0 you
don't index a single document twice. if you don't you re-index for each
replica but you can do bulk indexing with replication set to 0 and later
raise it once you are done bulk indexing.

if you do bulk indexing make sure you disable the refresh:

curl -XPUT localhost:9200/test/_settings -d '{
"index" : {
"refresh_interval" : "-1"
}

}'

I also recommend you do play around with the buffer size for the lucene's
internal ram buffer (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
setting this to a reasonable number is crucial for good indexing speed. I
personally haven't seen improvements in index speed above 500MB RAM buffer
size but that depends heavily on your documents.

if you want to know more about how indexing works under the hood you can
also read this blog:
http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/ but
note: Elasticsearch is on Lucene 3.6 and doesn't have the concurrent
flushing yet. (but soon!)

simon

On Sunday, October 14, 2012 3:06:49 PM UTC+2, simonw wrote:

hey hadar,

On Saturday, October 13, 2012 11:21:12 PM UTC+2, Hadar Rottenberg wrote:

I was wondering what's the best approach for indexing existing data
which is not expected to change.
Also I would like to understand better how sharding and indexing
happens, so I hope someone can help by answering my questions.

do you have already speed problems or are you just worried ahead of time?
The basic procedure is you send data to any node and it figures out to
which node it needs to go and forwards your request based on your routing
key (ID by default). If you want to speed this up you can make request
async as well as replication (see the async part of this document
Elasticsearch Platform — Find real-time answers at scale | Elastic -->
asynchronous replication & write consistency)

  1. I've read that lucene can reach 100gb/hour indexing wikipedia on
    standard hardware, is elasticsearch expected to scale linearly?

depends on what you call std hardward. But yes 100gb/hour is possible
with lucene. I have seen 300GB and even more with lucene 4.0 and concurrent
flushing.

  1. When performing indexing can you send data to index to all nodes?or do

you send it only to the master node which then distribute the documents
according to the selected routing hash?

you can send your data to any node, ES will figure it out where it needs
to go.

  1. when bulk indexing is the data also queued on the indexing node until
    X documents arrive?

I am not sure what you are referring to here?

  1. If shard location of documents is not important is it possible to
    simply give each node part of the data to index localy?

you can just use a local node by using the java API that will figure out
which node it needs to go and safes one hop on the way to the right server.

  1. is it possible to pre-sort the documents according to the node
    routing hash and then give the documents to each node locally? My idea is
    to use amazon EMR for the initial indexing since it is much cheaper then
    EC2,

I'm not sure if it is worth the trouble, you can scale out horizontally
to get more indexing speed?

is there a way to accomplish this? meaning without running elasticsearch
cloud, but simply creating lucene indexes which would then be used as
elasticsearch shards?

in theory this is possible, again I'd don't think its worth the trouble.
maybe you can tell me more about your concerns in terms of indexing speed?

  1. What kind of hashing algorithm does elasticsearch uses to decide on
    the document shard?

it uses a hash function of the DJB family.

Thanks
Hadar

--