You are referring to hints about speeding up indexing. In most cases,
you can gain more efficiency about 10-20% with this, so the hints are
for situations where you want to squeeze out the most of the existing
resources. But, to set up bulk indexing in normal situations, you don't
always need such tweaking, you can get very far with the default ES
settings.
The idea behind tweaking ES settings is as follows: while long baseline
bulk loads are running, admins like to disregard search, and realtime
search, in favor of some percentage of performance on the side of Lucene
indexing. The refresh interval disables realtime search, so,
IndexReader/IndexWriter switches and read I/O is reduced, and write I/O
can run with higher throughput. The merge policy factor may be
increased, to give Lucene's expensive segment merging more room.
The translog settings are interesting when you observe how many IOPS
(input/output operations per second) indexing needs.The idea is to
reduce the number of IOPS to reduce the stress on the disk subsystem.
Disk I/O is the slowest part in system efficiency by far, for example,
if ES indexing or translogging is using too many flushes to disk, the
indexing speed will badly suffer. Changing the translog flush settings
is one method, but, replacing slow disks by faster disks or SSD (or
loads of RAM) will gain far more efficiency.
Thrift protocol does not use JSON, it is an alternative to HTTP JSON. It
uses a compact binary protocol for object transports and reduces the
protocol overhead significantly. The serialization and deserialization
is faster than in HTTP. ES offers an optional Thrift plugin. For more
about Thrift, see http://jnb.ociweb.com/jnb/jnbJun2009.html
Jörg
Am 16.02.13 02:29, schrieb Jon Shea:
In the blog post announcing the Index Update Settings API
(Elasticsearch Platform — Find real-time answers at scale | Elastic)
Shay recommended setting "index.refresh_interval": -1
and
"index.merge.policy.merge_factor": 30
at the start of the backfill,
and then restoring them to their defaults after the backfill. There is
also this gist (Elasticsearch - Index best practices from Shay Banon · GitHub) that purports
to be advice from Shay, but some of the advice is confusing or
contradictory. I don’t understand how index.translog
relates to
index.refresh_interval
, for example. I also don’t understand why the
Thrift API would be much better, since it still requires a serialized
JSON representation of the document. There’s not much else you can do,
as far as I know.
Ideally, we’d love to be able to do a MapReduce that wrote Lucene /
Elasticsearch index files to disk on our Hadoop cluster outside of
Elasticsearch. And we’d like to be able to deploy these indexes by
doing something like scp'ing them into place on our Elasticsearch
cluster. But we haven’t yet invested the resources to figure out
exactly how to make that work.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.