Bulk loading


(Chris K Wensel) #1

Hey all

We are experimenting with bulk loading ES with Cascading/Hadoop.

Currently we are seeing the same ingress rate between the Node and Transport clients.

Besides the obvious options to get faster rates, are there ways to offload more work to the client side before invoking the ES cluster with the Node client?

Is it possible to create the indexes externally from the ES cluster (via ES apis) and then hand them off (via the gateway etc) once they are generated?

cheers,
chris

--
Chris K Wensel
chris@concurrentinc.com
http://www.concurrentinc.com

-- Concurrent, Inc. offers mentoring, support, and licensing for Cascading


(Shay Banon) #2

No, there is no option to build the indices externally... .

On Sat, Sep 11, 2010 at 7:28 AM, Chris K Wensel chris@wensel.net wrote:

Hey all

We are experimenting with bulk loading ES with Cascading/Hadoop.

Currently we are seeing the same ingress rate between the Node and
Transport clients.

Besides the obvious options to get faster rates, are there ways to offload
more work to the client side before invoking the ES cluster with the Node
client?

Is it possible to create the indexes externally from the ES cluster (via ES
apis) and then hand them off (via the gateway etc) once they are generated?

cheers,
chris

--
Chris K Wensel
chris@concurrentinc.com
http://www.concurrentinc.com

-- Concurrent, Inc. offers mentoring, support, and licensing for Cascading


(Otis Gospodnetić) #3

Well, couldn't one build raw Lucene indices (assuming one uses all the
correct field names and analysis as in the ES cluster that is to host
those indices later) and then just put them in the right directory -
the same dir where ES holds indices? Would that not work? If it
would, Chris, there is a Lucene indexing contrib in Hadoop called
index-contrib, I believe. See http://search-lucene.com/?q=lucene+hadoop+index

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

On Sep 11, 5:47 am, Shay Banon shay.ba...@elasticsearch.com wrote:

No, there is no option to build the indices externally... .

On Sat, Sep 11, 2010 at 7:28 AM, Chris K Wensel ch...@wensel.net wrote:

Hey all

We are experimenting with bulk loading ES with Cascading/Hadoop.

Currently we are seeing the same ingress rate between the Node and
Transport clients.

Besides the obvious options to get faster rates, are there ways to offload
more work to the client side before invoking the ES cluster with the Node
client?

Is it possible to create the indexes externally from the ES cluster (via ES
apis) and then hand them off (via the gateway etc) once they are generated?

cheers,
chris

--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com

-- Concurrent, Inc. offers mentoring, support, and licensing for Cascading


(Lukáš Vlček) #4

Otis, I think this approach would require also manual modification of
metadata in gateway and splitting index into appropriate number of Lucene
directories according to shard number. In this case I think it would be
easier to start one ES node and upload data into that instance having it
create all the shards for you. Then one could try manually merge the
metadata. However, I never tried this, so not sure if that is trivial.

Regards,
Lukas

On Sat, Sep 11, 2010 at 11:12 PM, Otis otis.gospodnetic@gmail.com wrote:

Well, couldn't one build raw Lucene indices (assuming one uses all the
correct field names and analysis as in the ES cluster that is to host
those indices later) and then just put them in the right directory -
the same dir where ES holds indices? Would that not work? If it
would, Chris, there is a Lucene indexing contrib in Hadoop called
index-contrib, I believe. See
http://search-lucene.com/?q=lucene+hadoop+index

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

On Sep 11, 5:47 am, Shay Banon shay.ba...@elasticsearch.com wrote:

No, there is no option to build the indices externally... .

On Sat, Sep 11, 2010 at 7:28 AM, Chris K Wensel ch...@wensel.net
wrote:

Hey all

We are experimenting with bulk loading ES with Cascading/Hadoop.

Currently we are seeing the same ingress rate between the Node and
Transport clients.

Besides the obvious options to get faster rates, are there ways to
offload

more work to the client side before invoking the ES cluster with the
Node

client?

Is it possible to create the indexes externally from the ES cluster
(via ES

apis) and then hand them off (via the gateway etc) once they are
generated?

cheers,
chris

--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com

-- Concurrent, Inc. offers mentoring, support, and licensing for
Cascading


(Shay Banon) #5

Let me explain a bit why its complicated to do, and I think in most cases
doen't make a lot of sense (compared to the alternative, of increasing
indexing speed by adding more elasticsearch machines).

If you want to build the index within your map reduce job, then probably
what you want is to build the indices locally on the machines the jobs are
currently running on. Then, what you would do is upload it to HDFS so they
will be accessible. Once all those indices have completed uploading the
HDFS, you need to then somehow make elasticsearch (or anything else) use it
by downloading the index files from HDFS to each "search" machine locally
and then use those indices.

There are several problems with this approach (is open for suggestions). The
first is that the sharding logic is probably different between the map
reduce job and the indices, as well as (more importantly) the shard count.
This means that rebuilding the index (which includes sharded Lucene indices)
is not something that can be easily done, if at all.

I have heard of people using Hadoop to do the map/reduce indexing process,
and then serve readonly indices from those results. This make sense in the
mentioned scenario, but this is certainly not something elasticsearch was
built for, and as you can see with where Google (with Google Instant) are
going, this will really hamper the freshness of the index.

I can see map/reduce like jobs running and interacting with an index mainly
for augmented extra data indexing, or bulk loading initial data. What you
would want to do is get the architecture to a place where data gets indexed
in real time, this is what elasticsearch was built for.

If you want, I can try and help with pointers as to how to improve the
indexing speed you get. Its quite easy to really increase it by using some
simple guidelines, for example:

  • Use create in the index API (assuming you can).
  • Relax the real time aspect from 1 second to something a bit higher
    (index.engine.robin.refresh_interval).
  • Increase the indexing buffer size (indices.memory.index_buffer_size), it
    defaults to the value 10% which is 10% of the heap.
  • Increase the number of dirty operations that trigger automatic flush (so
    the translog won't get really big, even though its FS based) by
    setting index.translog.flush_threshold (defaults to 5000).
  • Increase the memory allocated to elasticsearch node. By default its 1g.
  • Start with a lower replica count (even 0), and then once the bulk loading
    is done, increate it to the value you want it to be using the
    update_settings API. This will improve things as possibly less shards will
    be allocated to each machine.
  • Increase the number of machines you have so you get less shards allocated
    per machine.
  • Increase the number of shards an index has, so it can make use of more
    machines.
  • Make sure you make full use of the concurrent aspect of elasticsearch. You
    might not pushing it hard enough. For example, the map reduce job can index
    things concurrently. Just make sure not to overload elasticsearch.
  • Make Lucene use the non compound file format (basically, each segment gets
    compounded into a single file when using the compound file format). This
    will increase the number of open files, so make sure you have enough. Set
    index.merge.policy.use_compound_file to false.

If not using Java, there are more things to play with:

  • Try and use the thrift client instead of HTTP.

-shay.banon

On Sun, Sep 12, 2010 at 12:43 AM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Otis, I think this approach would require also manual modification of
metadata in gateway and splitting index into appropriate number of Lucene
directories according to shard number. In this case I think it would be
easier to start one ES node and upload data into that instance having it
create all the shards for you. Then one could try manually merge the
metadata. However, I never tried this, so not sure if that is trivial.

Regards,
Lukas

On Sat, Sep 11, 2010 at 11:12 PM, Otis otis.gospodnetic@gmail.com wrote:

Well, couldn't one build raw Lucene indices (assuming one uses all the
correct field names and analysis as in the ES cluster that is to host
those indices later) and then just put them in the right directory -
the same dir where ES holds indices? Would that not work? If it
would, Chris, there is a Lucene indexing contrib in Hadoop called
index-contrib, I believe. See
http://search-lucene.com/?q=lucene+hadoop+index

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

On Sep 11, 5:47 am, Shay Banon shay.ba...@elasticsearch.com wrote:

No, there is no option to build the indices externally... .

On Sat, Sep 11, 2010 at 7:28 AM, Chris K Wensel ch...@wensel.net
wrote:

Hey all

We are experimenting with bulk loading ES with Cascading/Hadoop.

Currently we are seeing the same ingress rate between the Node and
Transport clients.

Besides the obvious options to get faster rates, are there ways to
offload

more work to the client side before invoking the ES cluster with the
Node

client?

Is it possible to create the indexes externally from the ES cluster
(via ES

apis) and then hand them off (via the gateway etc) once they are
generated?

cheers,
chris

--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com

-- Concurrent, Inc. offers mentoring, support, and licensing for
Cascading


(system) #6