Looking for advice on bulk loading

Hi all,

I'm looking for some advice on getting lots of docs into es.

We are generated es docs from a db and each doc is around 1k at the moment.
We expect to increase the size of each doc to around 4k as we include more
data. Our initial set is 250M docs.

The db is distributed and we're using hadoop's map/reduce to generate the
json docs. We could write them out to disk and load, though at the moment
we've been trying to post them direct into es. We're using the java client
api and bulk requests with 20 docs each at the moment. We have capacity to
run around 160 doc generators concurrently, and doing that knocks es over
quite nicely. We have a 6 node cluster and each node is running both our
converters and es.

Is anyone doing anything similar and would be able to suggest approaches to
loading the data into es in the most efficient way?

thanks

rob

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Rob Styles wrote:

We are generated es docs from a db and each doc is around 1k at the
moment. We expect to increase the size of each doc to around 4k as
we include more data. Our initial set is 250M docs.

[...]

We have capacity to run around 160 doc generators concurrently, and
doing that knocks es over quite nicely.

We have a 6 node cluster and each node is running both our
converters and es.

First of all, what kind of document index rate do you get before it
falls over? It may be that you're currently maximizing perf and you
would need to add nodes.

Some scattershot questions to try and find a possible issue:

What exactly do you mean by knocking ES over? Do you get OOMEs? Do
your nodes become unresponsive? Is it only a few nodes that have
trouble?

How much RAM do the machines have and what do you have set for your
ES heap settings?

Are you indexing into a single index or multiple? What's the output
of es shards from https://github.com/elasticsearch/es2unix?

If you back off the number of generators, at what number does the
cluster experience problems?

-Drew

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Rob,

I am not sure what your doc generators look like, so I understand you
send from 160 threads (or JVMs?), each one using a (Transport)Client,
with the capacity of 160 x 20 = 3200 docs simultaneously? Or is it more?
Do you connect to all 6 nodes (TransportClient sniff mode enabled)? 3200
docs per second should be possible, it depends on the fields and
analyzers of course.

I understand your setup that Hadoop converter, the ES (Transport)Client
instance, and ES itself share a single machine? Or a single JVM? I would
suggest to spread the ingest load to other machines with dedicated
TransportClients that can connect from remote to the cluster. So,
preparing the JSON with Hadoop converters or whatever and compressing
them for the wire is decoupled from the ES cluster main task, the
Lucene-based indexing. And, with the (gigabit) network in between, you
can measure the throughput you can push into the cluster more easily.

It would be cool to get more info how your bulk requests are managed.
Did you look at org.elasticsearch.action.bulk.BulkProcessor? It
demonstrates how to throttle bulk indexing with the help of a semaphore
around the BulkResponses. Note, there are also thread pools and netty
workers that can be reconfigured for large bulks on multicore machines.
With 160 simultaneous threads there is probably not much room for
queueing, but it may help when the clients wait for outstanding bulk
requests when a concurrency limit is exceeded or pause when a specific
response time limit is exceeded.

Best regards,

Jörg

Am 15.02.13 14:50, schrieb Rob Styles:

Hi all,

I'm looking for some advice on getting lots of docs into es.

We are generated es docs from a db and each doc is around 1k at the
moment. We expect to increase the size of each doc to around 4k as we
include more data. Our initial set is 250M docs.

The db is distributed and we're using hadoop's map/reduce to generate
the json docs. We could write them out to disk and load, though at the
moment we've been trying to post them direct into es. We're using the
java client api and bulk requests with 20 docs each at the moment. We
have capacity to run around 160 doc generators concurrently, and doing
that knocks es over quite nicely. We have a 6 node cluster and each
node is running both our converters and es.

Is anyone doing anything similar and would be able to suggest
approaches to loading the data into es in the most efficient way?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

We do pretty much the the same thing you do. We will backfill either by
running a script that traverses a database table and sends bulk index
requests to our Elasticsearch cluster over the Java API, or by running a
MapReduce on Hadoop cluster that sends bulk index requests to Elasticsearch
over the Java API. Either way, we put a sleep in between bulk insert
requests which we can adjust dynamically to slow down the backfill if our
Elasticsearch cluster is getting overloaded. You’ll have to experiment to
figure out how quickly you can push the backfill. I recommend figuring out
a way to put in a sleep that you can adjust while the backfill is running
to make it easier to figure out where your limits are and adjust. I also
recommend catching index errors or exceptions and waiting for a generous
period of time (several seconds) before retrying.

In the blog post announcing the Index Update Settings API
(http://www.elasticsearch.org/blog/2011/03/23/update-settings.html) Shay
recommended setting "index.refresh_interval": -1 and
"index.merge.policy.merge_factor": 30 at the start of the backfill, and
then restoring them to their defaults after the backfill. There is also
this gist (https://gist.github.com/duydo/2427158) that purports to be
advice from Shay, but some of the advice is confusing or contradictory. I
don’t understand how index.translog relates to index.refresh_interval,
for example. I also don’t understand why the Thrift API would be much
better, since it still requires a serialized JSON representation of the
document. There’s not much else you can do, as far as I know.

Ideally, we’d love to be able to do a MapReduce that wrote Lucene /
Elasticsearch index files to disk on our Hadoop cluster outside of
Elasticsearch. And we’d like to be able to deploy these indexes by doing
something like scp'ing them into place on our Elasticsearch cluster. But we
haven’t yet invested the resources to figure out exactly how to make that
work.

-Jon

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You are referring to hints about speeding up indexing. In most cases,
you can gain more efficiency about 10-20% with this, so the hints are
for situations where you want to squeeze out the most of the existing
resources. But, to set up bulk indexing in normal situations, you don't
always need such tweaking, you can get very far with the default ES
settings.

The idea behind tweaking ES settings is as follows: while long baseline
bulk loads are running, admins like to disregard search, and realtime
search, in favor of some percentage of performance on the side of Lucene
indexing. The refresh interval disables realtime search, so,
IndexReader/IndexWriter switches and read I/O is reduced, and write I/O
can run with higher throughput. The merge policy factor may be
increased, to give Lucene's expensive segment merging more room.

The translog settings are interesting when you observe how many IOPS
(input/output operations per second) indexing needs.The idea is to
reduce the number of IOPS to reduce the stress on the disk subsystem.
Disk I/O is the slowest part in system efficiency by far, for example,
if ES indexing or translogging is using too many flushes to disk, the
indexing speed will badly suffer. Changing the translog flush settings
is one method, but, replacing slow disks by faster disks or SSD (or
loads of RAM) will gain far more efficiency.

Thrift protocol does not use JSON, it is an alternative to HTTP JSON. It
uses a compact binary protocol for object transports and reduces the
protocol overhead significantly. The serialization and deserialization
is faster than in HTTP. ES offers an optional Thrift plugin. For more
about Thrift, see http://jnb.ociweb.com/jnb/jnbJun2009.html

Jörg

Am 16.02.13 02:29, schrieb Jon Shea:

In the blog post announcing the Index Update Settings API
(http://www.elasticsearch.org/blog/2011/03/23/update-settings.html)
Shay recommended setting "index.refresh_interval": -1 and
"index.merge.policy.merge_factor": 30 at the start of the backfill,
and then restoring them to their defaults after the backfill. There is
also this gist (https://gist.github.com/duydo/2427158) that purports
to be advice from Shay, but some of the advice is confusing or
contradictory. I don’t understand how index.translog relates to
index.refresh_interval, for example. I also don’t understand why the
Thrift API would be much better, since it still requires a serialized
JSON representation of the document. There’s not much else you can do,
as far as I know.

Ideally, we’d love to be able to do a MapReduce that wrote Lucene /
Elasticsearch index files to disk on our Hadoop cluster outside of
Elasticsearch. And we’d like to be able to deploy these indexes by
doing something like scp'ing them into place on our Elasticsearch
cluster. But we haven’t yet invested the resources to figure out
exactly how to make that work.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jörg,

I completely agree with your indexing advice. To summarize for Rob:

  1. You’re doing it pretty much how everyone does it.
  2. If you’re taking down your cluster, you need to slow down the rate you
    index. Use a configurable sleep, or fewer Hadoop machines.
  3. If you want to index documents faster, the best thing you can do is run
    more shards on more Elasticsearch nodes (one shard per node is optimal). It
    might also help to use better hardware (like SSDs), but I haven’t profiled
    that.
  4. The Elasticsearch defaults for indexing are pretty good, but you might
    be able to tweak them to get tens of percent improvements. See the links in
    my original post.

I remain a little skeptical about the Thrift API, though. I’ve looked at it
several times, and it’s really more of an HTTP on Thrift API than a first
class Thrift API. The Thrift struct for API requests has a binary blob
body
(https://github.com/elasticsearch/elasticsearch-transport-thrift/blob/master/elasticsearch.thrift#L23).
I haven’t bothered to fully trace the code, and the usage isn’t thoroughly
documented, but I presume that to use the Thrift API you form the JSON for
an API request, serialized it to UTF-8, and then put it in the body field
of a Thrift RestRequest. Please correct me if I’m working about that.
Parsing a Thrift API request might be somewhat less work for Elasticsearch
than parsing an HTTP API request, but parsing the body contents of a
Thrift API request is going to be the same parsing the body contents of an
HTTP request. I haven’t profiled this, but I’d be surprised if
Elasticsearch was spending a ton of time parsing HTTP overhead.

Regardless, thanks for your advice on indexing. I really appreciate it.

-Jon

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.