Questions from a newbie


(Steff) #1

Hi

We are in the startup-phase of a projekt where we consider using ES. I
have a few questions that I would please like someone to answer before
we can proceed.

A little bit about our needs:
We need to store 18billion+ records consisting of stuff like numbers,
timestamps and text. We need to be able to do searches including
prefix-search on the numbers (e.g. 123* should find 12345678),
range-search on timestamps and full-text search on texts. Our system
will be very write-extensive (about 50million+ new records will need to
be stored/indexed every day and 50million+ old records will need to be
deleted). It will not be very read-extensive (only a few queries a day
is expected).

Questions:

  • First of all in general we will be happy to receive any input you will
    find relevant for us.

  • In general, do you believe ES will fit our needs - especially with
    respect to the number of records mentioned above and the read/write
    balance.

  • Are there some way of controlling which data goes to what shard - e.g.
    that all records where number-A is between 12340000 and 12349999 always
    goes on one specific shard. I believe I have heard somewhere from one of
    the ES guys that it is possible to make all records with a certain
    timestamp range go into the same shard, and that it will be smart to do
    if you want to delete "old" records based on timestamp (it should be
    easier to delete an entire shard that deleting only selected records
    inside of it), so I guess it is possible. Since the 50million+ new
    records we will receive every day will probably all have timestamps from
    the same day, I believe we will certainly not want to configure it so
    that certain timestamp-ranges go into certain shards, because that will
    make a bottleneck on those few shards (all 50million+ records of a day
    will have to be inserted into very few shards). Since we are so
    write-extensive it is important that we are able to utilize all power in
    the cluster for writing, and that we are able to scale with respect to
    writing capabilility by adding more nodes/shards. Any comments on that
    thought? Am I right, or did I miss something?

  • Do you have a rule-of-thumb about the maximum size of a shard, or put
    in another way how many shards will we need to be able to store the
    18billion+ records? I guess that number will basically be dependent on
    limitations in Lucene, where I believe I have heard that you shouldnt go
    beyond 10-100million records. Will 100 shards be enough for us, is less
    enough or do we need to consider 1000?

  • Do ES have any kind of rebalancing mechanism? E.g. if a nodes join
    along the way and we dynamically increase the number of shards, will it
    be able to "even out" the data already stored, or will it only have
    influence on data added after that point in time where we add
    nodes/shards. Guess this will be hard if we have put any kind of
    restrictions on which shard a certain range of values need to go to (see
    above).

  • Reading about persistence in ES I have a hard time figuring out
    exactly how it works - node-local storage vs gateway-storage. Do you
    have a pointer to a thorough description on how persistence work? I
    general I want to make sure that data will be persisted-persisted, when
    an inserting/indexing-process has done a number of inserts (maybe bulk
    inserts) and it finishes. It isnt allowed to be possible that a
    inserting-process thinks that is has inserted a number of records but
    that they are really not persisted-persisted yet. With
    persisted-persisted I mean that no data will disappear even though all
    nodes in the cluster will stop (e.g. due to global power-outage) a
    split-sec after the process finished, or even though any single disk
    will crash a split-sec after the process finished. So
    persisted-persisted means stored on disk (will survive shutdown of
    machine) - actually stored on at least two disks (redundant). I believe
    I heard one of the ES guys saying something about records/indexes not
    being persisted-persisted unless IndexWriter.commit (or something) of
    the Lucene underneath has been called. If that is true I guess I need a
    way through ES to make sure that this has happened. I also heard that
    this operation is expensive, and that it should therefore not be done
    too often. I need to make sure that it has been done when my
    insert/index-process finishes (call the operation as the last thing in
    the process), but if it is expensive I guess I need to make sure that my
    insert/index-processes are not too small with respect to the number of
    records that they insert. Any comments on that? What will a practical
    lower limit on number of inserts/indexes that have to be done between
    IndexWriter.commits?

  • How to do updates to a record? Do I need to find the existing record
    (e.g. by id), delete the existing record and insert/index a new record
    with the combined information from the old record and the new
    information I have to add to it? Or are there any other way updating
    records? What about transaction isolation when doing this - if two
    processes are updating an existing record "at the same time" will I be
    sure that one of them will fail and that the other one will succeed?

  • We will have to run a lot of concurrent insert/index-processes. Any
    comments about that? Strategies on how to do that the best way?

  • Auto establishment of redundancy? Will the ES cluster automatically
    notice when an "instance" of a shard (primary or secondary) goes down,
    and automatically start establishing a new "instance" of that shard from
    the one that did not go down, in order to reestablish redundancy of that
    shard?

  • Any comments on ACID transactions? I would be nice if I can have all
    the ACID-properties in a transaction spanning the entire work of a
    insert/index-process - all or nothing is inserted/indexed, it is
    consistent with respect to what other processes might do concurrently,
    isolation-levels? and durability (guess I have touched that already).
    Will ES be able to participate in distributed XA transactions with
    two-phase-commit etc.

  • The numbers mentioned in the beginning (18billion all in all,
    50million in/out per day) are only first-attempt-requirements. Basically
    we need to scale to "infinitly" (very far) on those numbers. Are there
    any known limitations on how far a ES cluster can scale - that is, is
    there a limit where I cannot just buy more hardware to support higher
    numbers?

Regards, Per Steffensen


(Steff) #2

An additional question:

  • As I understand it ES will automatically (maybe controlled by some
    config by us) chose the shard where a new record will be
    inserted/indexed, as opposed to Solr/Solandra where the distribution
    among shards is more manuel. It this correctly understood? Our
    insert/index-processes will (potentially) run on the same physical
    machines that participate in the ES cluster, and it might be a good idea
    (with respect to network overhead) that it inserts/indexes into a shard
    that happens to be running on the same physical machine. Any comments on
    that? Do we have any control-buttons for this area in ES?

Regards, Per Steffensen

Per Steffensen skrev:

Hi

We are in the startup-phase of a projekt where we consider using ES. I
have a few questions that I would please like someone to answer before
we can proceed.

A little bit about our needs:
We need to store 18billion+ records consisting of stuff like numbers,
timestamps and text. We need to be able to do searches including
prefix-search on the numbers (e.g. 123* should find 12345678),
range-search on timestamps and full-text search on texts. Our system
will be very write-extensive (about 50million+ new records will need
to be stored/indexed every day and 50million+ old records will need to
be deleted). It will not be very read-extensive (only a few queries a
day is expected).

Questions:

  • First of all in general we will be happy to receive any input you
    will find relevant for us.

  • In general, do you believe ES will fit our needs - especially with
    respect to the number of records mentioned above and the read/write
    balance.

  • Are there some way of controlling which data goes to what shard -
    e.g. that all records where number-A is between 12340000 and 12349999
    always goes on one specific shard. I believe I have heard somewhere
    from one of the ES guys that it is possible to make all records with a
    certain timestamp range go into the same shard, and that it will be
    smart to do if you want to delete "old" records based on timestamp (it
    should be easier to delete an entire shard that deleting only selected
    records inside of it), so I guess it is possible. Since the 50million+
    new records we will receive every day will probably all have
    timestamps from the same day, I believe we will certainly not want to
    configure it so that certain timestamp-ranges go into certain shards,
    because that will make a bottleneck on those few shards (all
    50million+ records of a day will have to be inserted into very few
    shards). Since we are so write-extensive it is important that we are
    able to utilize all power in the cluster for writing, and that we are
    able to scale with respect to writing capabilility by adding more
    nodes/shards. Any comments on that thought? Am I right, or did I miss
    something?

  • Do you have a rule-of-thumb about the maximum size of a shard, or
    put in another way how many shards will we need to be able to store
    the 18billion+ records? I guess that number will basically be
    dependent on limitations in Lucene, where I believe I have heard that
    you shouldnt go beyond 10-100million records. Will 100 shards be
    enough for us, is less enough or do we need to consider 1000?

  • Do ES have any kind of rebalancing mechanism? E.g. if a nodes join
    along the way and we dynamically increase the number of shards, will
    it be able to "even out" the data already stored, or will it only have
    influence on data added after that point in time where we add
    nodes/shards. Guess this will be hard if we have put any kind of
    restrictions on which shard a certain range of values need to go to
    (see above).

  • Reading about persistence in ES I have a hard time figuring out
    exactly how it works - node-local storage vs gateway-storage. Do you
    have a pointer to a thorough description on how persistence work? I
    general I want to make sure that data will be persisted-persisted,
    when an inserting/indexing-process has done a number of inserts (maybe
    bulk inserts) and it finishes. It isnt allowed to be possible that a
    inserting-process thinks that is has inserted a number of records but
    that they are really not persisted-persisted yet. With
    persisted-persisted I mean that no data will disappear even though all
    nodes in the cluster will stop (e.g. due to global power-outage) a
    split-sec after the process finished, or even though any single disk
    will crash a split-sec after the process finished. So
    persisted-persisted means stored on disk (will survive shutdown of
    machine) - actually stored on at least two disks (redundant). I
    believe I heard one of the ES guys saying something about
    records/indexes not being persisted-persisted unless
    IndexWriter.commit (or something) of the Lucene underneath has been
    called. If that is true I guess I need a way through ES to make sure
    that this has happened. I also heard that this operation is expensive,
    and that it should therefore not be done too often. I need to make
    sure that it has been done when my insert/index-process finishes (call
    the operation as the last thing in the process), but if it is
    expensive I guess I need to make sure that my insert/index-processes
    are not too small with respect to the number of records that they
    insert. Any comments on that? What will a practical lower limit on
    number of inserts/indexes that have to be done between
    IndexWriter.commits?

  • How to do updates to a record? Do I need to find the existing record
    (e.g. by id), delete the existing record and insert/index a new record
    with the combined information from the old record and the new
    information I have to add to it? Or are there any other way updating
    records? What about transaction isolation when doing this - if two
    processes are updating an existing record "at the same time" will I be
    sure that one of them will fail and that the other one will succeed?

  • We will have to run a lot of concurrent insert/index-processes. Any
    comments about that? Strategies on how to do that the best way?

  • Auto establishment of redundancy? Will the ES cluster automatically
    notice when an "instance" of a shard (primary or secondary) goes down,
    and automatically start establishing a new "instance" of that shard
    from the one that did not go down, in order to reestablish redundancy
    of that shard?

  • Any comments on ACID transactions? I would be nice if I can have all
    the ACID-properties in a transaction spanning the entire work of a
    insert/index-process - all or nothing is inserted/indexed, it is
    consistent with respect to what other processes might do concurrently,
    isolation-levels? and durability (guess I have touched that already).
    Will ES be able to participate in distributed XA transactions with
    two-phase-commit etc.

  • The numbers mentioned in the beginning (18billion all in all,
    50million in/out per day) are only first-attempt-requirements.
    Basically we need to scale to "infinitly" (very far) on those numbers.
    Are there any known limitations on how far a ES cluster can scale -
    that is, is there a limit where I cannot just buy more hardware to
    support higher numbers?

Regards, Per Steffensen


(Steff) #3

Per Steffensen skrev:

An additional question:

  • As I understand it ES will automatically (maybe controlled by some
    config by us) chose the shard where a new record will be
    inserted/indexed, as opposed to Solr/Solandra where the distribution
    among shards is more manuel. It this correctly understood? Our
    insert/index-processes will (potentially) run on the same physical
    machines that participate in the ES cluster, and it might be a good
    idea (with respect to network overhead) that it inserts/indexes into a
    shard that happens to be running on the same physical machine. Any
    comments on that? Do we have any control-buttons for this area in ES?

Regards, Per Steffensen

Per Steffensen skrev:

Hi

We are in the startup-phase of a projekt where we consider using ES.
I have a few questions that I would please like someone to answer
before we can proceed.

A little bit about our needs:
We need to store 18billion+ records consisting of stuff like numbers,
timestamps and text. We need to be able to do searches including
prefix-search on the numbers (e.g. 123* should find 12345678),
range-search on timestamps and full-text search on texts. Our system
will be very write-extensive (about 50million+ new records will need
to be stored/indexed every day and 50million+ old records will need
to be deleted). It will not be very read-extensive (only a few
queries a day is expected).

Questions:

  • First of all in general we will be happy to receive any input you
    will find relevant for us.

  • In general, do you believe ES will fit our needs - especially with
    respect to the number of records mentioned above and the read/write
    balance.

  • Are there some way of controlling which data goes to what shard -
    e.g. that all records where number-A is between 12340000 and 12349999
    always goes on one specific shard. I believe I have heard somewhere
    from one of the ES guys that it is possible to make all records with
    a certain timestamp range go into the same shard, and that it will be
    smart to do if you want to delete "old" records based on timestamp
    (it should be easier to delete an entire shard that deleting only
    selected records inside of it), so I guess it is possible. Since the
    50million+ new records we will receive every day will probably all
    have timestamps from the same day, I believe we will certainly not
    want to configure it so that certain timestamp-ranges go into certain
    shards, because that will make a bottleneck on those few shards (all
    50million+ records of a day will have to be inserted into very few
    shards). Since we are so write-extensive it is important that we are
    able to utilize all power in the cluster for writing, and that we are
    able to scale with respect to writing capabilility by adding more
    nodes/shards. Any comments on that thought? Am I right, or did I miss
    something?

  • Do you have a rule-of-thumb about the maximum size of a shard, or
    put in another way how many shards will we need to be able to store
    the 18billion+ records? I guess that number will basically be
    dependent on limitations in Lucene, where I believe I have heard that
    you shouldnt go beyond 10-100million records. Will 100 shards be
    enough for us, is less enough or do we need to consider 1000?

  • Do ES have any kind of rebalancing mechanism? E.g. if a nodes join
    along the way and we dynamically increase the number of shards, will
    it be able to "even out" the data already stored, or will it only
    have influence on data added after that point in time where we add
    nodes/shards. Guess this will be hard if we have put any kind of
    restrictions on which shard a certain range of values need to go to
    (see above).

  • Reading about persistence in ES I have a hard time figuring out
    exactly how it works - node-local storage vs gateway-storage. Do you
    have a pointer to a thorough description on how persistence work? I
    general I want to make sure that data will be persisted-persisted,
    when an inserting/indexing-process has done a number of inserts
    (maybe bulk inserts) and it finishes. It isnt allowed to be possible
    that a inserting-process thinks that is has inserted a number of
    records but that they are really not persisted-persisted yet. With
    persisted-persisted I mean that no data will disappear even though
    all nodes in the cluster will stop (e.g. due to global power-outage)
    a split-sec after the process finished, or even though any single
    disk will crash a split-sec after the process finished. So
    persisted-persisted means stored on disk (will survive shutdown of
    machine) - actually stored on at least two disks (redundant). I
    believe I heard one of the ES guys saying something about
    records/indexes not being persisted-persisted unless
    IndexWriter.commit (or something) of the Lucene underneath has been
    called. If that is true I guess I need a way through ES to make sure
    that this has happened. I also heard that this operation is
    expensive, and that it should therefore not be done too often. I need
    to make sure that it has been done when my insert/index-process
    finishes (call the operation as the last thing in the process), but
    if it is expensive I guess I need to make sure that my
    insert/index-processes are not too small with respect to the number
    of records that they insert. Any comments on that? What will a
    practical lower limit on number of inserts/indexes that have to be
    done between IndexWriter.commits?

  • How to do updates to a record? Do I need to find the existing
    record (e.g. by id), delete the existing record and insert/index a
    new record with the combined information from the old record and the
    new information I have to add to it? Or are there any other way
    updating records? What about transaction isolation when doing this -
    if two processes are updating an existing record "at the same time"
    will I be sure that one of them will fail and that the other one will
    succeed?
    When updating I need to be able to find the record that has to be
    updated without involving all shards, or else I will not be able to
    scale in number-of-possible-updates-per-time-unit - that is, I will not
    be able to just buy more hardware to be able to support more
    updates-per-time-unit, just as I expect to be able to support more
    inserts-per-time-unit by buying more hardware. When I want to update I
    know that only 0 or 1 record will exist living up to the
    search-criterias I will use to find the record to be updated, and that
    the query will therefore return a resultset of size 0 or 1. In order to
    not involve all shards for such queries, there need to be some kind of
    configuration (the same as the one controlling the destination of a new
    record among shards) that ES is able to take into consideration when
    performing the search - only ask the one shard where it know the record
    will exist if it exists. What kind solutions do you have in this area?
    It this possible? Only on id's of the records? Or?

  • We will have to run a lot of concurrent insert/index-processes. Any
    comments about that? Strategies on how to do that the best way?

  • Auto establishment of redundancy? Will the ES cluster automatically
    notice when an "instance" of a shard (primary or secondary) goes
    down, and automatically start establishing a new "instance" of that
    shard from the one that did not go down, in order to reestablish
    redundancy of that shard?

  • Any comments on ACID transactions? I would be nice if I can have
    all the ACID-properties in a transaction spanning the entire work of
    a insert/index-process - all or nothing is inserted/indexed, it is
    consistent with respect to what other processes might do
    concurrently, isolation-levels? and durability (guess I have touched
    that already). Will ES be able to participate in distributed XA
    transactions with two-phase-commit etc.

  • The numbers mentioned in the beginning (18billion all in all,
    50million in/out per day) are only first-attempt-requirements.
    Basically we need to scale to "infinitly" (very far) on those
    numbers. Are there any known limitations on how far a ES cluster can
    scale - that is, is there a limit where I cannot just buy more
    hardware to support higher numbers?

Regards, Per Steffensen


(James Cook) #4

Reply inline

On Wednesday, September 7, 2011 4:36:13 AM UTC-4, Steff wrote:

Questions:

  • First of all in general we will be happy to receive any input you will
    find relevant for us.

  • In general, do you believe ES will fit our needs - especially with
    respect to the number of records mentioned above and the read/write
    balance.

I think the size of your documents will have as much bearing as the count. I
am aware of a company that has indexed 4TB of raw data on 16 m2.xlarge EC2
nodes. There may be larger datasets out there.

  • Are there some way of controlling which data goes to what shard - e.g.
    that all records where number-A is between 12340000 and 12349999 always
    goes on one specific shard. I believe I have heard somewhere from one of
    the ES guys that it is possible to make all records with a certain
    timestamp range go into the same shard, and that it will be smart to do
    if you want to delete "old" records based on timestamp (it should be
    easier to delete an entire shard that deleting only selected records
    inside of it), so I guess it is possible. Since the 50million+ new
    records we will receive every day will probably all have timestamps from
    the same day, I believe we will certainly not want to configure it so
    that certain timestamp-ranges go into certain shards, because that will
    make a bottleneck on those few shards (all 50million+ records of a day
    will have to be inserted into very few shards). Since we are so
    write-extensive it is important that we are able to utilize all power in
    the cluster for writing, and that we are able to scale with respect to
    writing capabilility by adding more nodes/shards. Any comments on that
    thought? Am I right, or did I miss something?

http://www.elasticsearch.org/guide/reference/mapping/routing-field.html

The number of shards is not dynamic. You can change the number of replicas
at runtime, but not the number of shards.

  • Do you have a rule-of-thumb about the maximum size of a shard, or put
    in another way how many shards will we need to be able to store the
    18billion+ records? I guess that number will basically be dependent on
    limitations in Lucene, where I believe I have heard that you shouldnt go
    beyond 10-100million records. Will 100 shards be enough for us, is less
    enough or do we need to consider 1000?

100 shards = 180M records = ? GB

There is anecdotal evidence in the mailing list of a user with 6 servers
running 128 shards and indexing TB's of data.

  • Do ES have any kind of rebalancing mechanism? E.g. if a nodes join
    along the way and we dynamically increase the number of shards, will it
    be able to "even out" the data already stored, or will it only have
    influence on data added after that point in time where we add
    nodes/shards. Guess this will be hard if we have put any kind of
    restrictions on which shard a certain range of values need to go to (see
    above).

Dynamically adding/removing nodes from an ES cluster is a key feature. A
single node can contain many shards. Shard count is fixed. As new nodes come
and go, the shards are rebalanced across the cluster.

  • Reading about persistence in ES I have a hard time figuring out
    exactly how it works - node-local storage vs gateway-storage. Do you
    have a pointer to a thorough description on how persistence work? I
    general I want to make sure that data will be persisted-persisted, when
    an inserting/indexing-process has done a number of inserts (maybe bulk
    inserts) and it finishes. It isnt allowed to be possible that a
    inserting-process thinks that is has inserted a number of records but
    that they are really not persisted-persisted yet. With
    persisted-persisted I mean that no data will disappear even though all
    nodes in the cluster will stop (e.g. due to global power-outage) a
    split-sec after the process finished, or even though any single disk
    will crash a split-sec after the process finished. So
    persisted-persisted means stored on disk (will survive shutdown of
    machine) - actually stored on at least two disks (redundant). I believe
    I heard one of the ES guys saying something about records/indexes not
    being persisted-persisted unless IndexWriter.commit (or something) of
    the Lucene underneath has been called. If that is true I guess I need a
    way through ES to make sure that this has happened. I also heard that
    this operation is expensive, and that it should therefore not be done
    too often. I need to make sure that it has been done when my
    insert/index-process finishes (call the operation as the last thing in
    the process), but if it is expensive I guess I need to make sure that my
    insert/index-processes are not too small with respect to the number of
    records that they insert. Any comments on that? What will a practical
    lower limit on number of inserts/indexes that have to be done between
    IndexWriter.commits?

As long as the hard disk containing the primary shard or one of its
associated replicas survives, the cluster startup process will be able to
recover the indexed data.

  • How to do updates to a record? Do I need to find the existing record
    (e.g. by id), delete the existing record and insert/index a new record
    with the combined information from the old record and the new
    information I have to add to it? Or are there any other way updating
    records? What about transaction isolation when doing this - if two
    processes are updating an existing record "at the same time" will I be
    sure that one of them will fail and that the other one will succeed?

To update a record, you read document, make changes to document, index
document. Optimistic concurrency is supported using versioning.

  • We will have to run a lot of concurrent insert/index-processes. Any
    comments about that? Strategies on how to do that the best way?

That's what ES is built to do, so I don't know what you refer to. If you are
looking for the best rate of indexing, then throw more indexing threads at
the problem. There is a point where ES cannot handle the load so throttling
may be important.

  • Auto establishment of redundancy? Will the ES cluster automatically
    notice when an "instance" of a shard (primary or secondary) goes down,
    and automatically start establishing a new "instance" of that shard from
    the one that did not go down, in order to reestablish redundancy of that
    shard?

ES does not spin up instances of itself. On EC2, we have EC2 monitor our CPU
load on our ES instances and create/destroy instances based on this metric.
The recovery, balancing, etc. is all automatic within ES.

  • Any comments on ACID transactions? I would be nice if I can have all
    the ACID-properties in a transaction spanning the entire work of a
    insert/index-process - all or nothing is inserted/indexed, it is
    consistent with respect to what other processes might do concurrently,
    isolation-levels? and durability (guess I have touched that already).
    Will ES be able to participate in distributed XA transactions with
    two-phase-commit etc.

I believe Shay has said there are no plans for distributed transactions.

  • The numbers mentioned in the beginning (18billion all in all,
    50million in/out per day) are only first-attempt-requirements. Basically
    we need to scale to "infinitly" (very far) on those numbers. Are there
    any known limitations on how far a ES cluster can scale - that is, is
    there a limit where I cannot just buy more hardware to support higher
    numbers?

Why don't you let us know? :slight_smile: Seriously. I haven't seen anyone say that they
can't handle a particular load yet by adding more nodes.

Regards, Per Steffensen


(James Cook) #5

You don't get to control which node or shard the document considers its
master, but you can provide a value to base your routing on.

http://www.elasticsearch.org/guide/reference/mapping/routing-field.html

The only thing you know for sure is that documents with the same routing
value go to the same shard.


(James Cook) #6

When updating, the primary shard (and its replicas) are known to ES based on
the routing id of the document. It doesn't use multicast or blast a message
to all nodes in the cluster.


(Berkay Mollamustafaoglu-2) #7

In ES you have several options for sharding:

  • you can specify number of shards (fixed number) per index, and let ES
    decide how to distribute the docs to the shards, can be called "automatic
    sharding"
  • you can use automatic sharding but control which docs go to which shard
    thru "routing"
  • you can create indices yourself and use aliases. ES allows you to use
    aliases or specify which indices a query should be executed against. sort of
    self managed sharding

If you will keep adding new data into the index in time, I think option 3,
self controlled sharding, is more suitable than the automatic sharding. You
can control from how to segment the data in your code, based on number of
docs, size of index etc. date, etc. you can also control which indices a
query should be executed against. For example, if you split the data based
on date, say an index per week, etc. you can add some intelligence to your
queries to direct the query only to the indexes that are relevant. You can
also use combination of automated and controlled sharding, create an index
per week and have each index have 2-3 shards, etc.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Wed, Sep 7, 2011 at 4:36 AM, Per Steffensen steff@designware.dk wrote:

Hi

We are in the startup-phase of a projekt where we consider using ES. I have
a few questions that I would please like someone to answer before we can
proceed.

A little bit about our needs:
We need to store 18billion+ records consisting of stuff like numbers,
timestamps and text. We need to be able to do searches including
prefix-search on the numbers (e.g. 123* should find 12345678), range-search
on timestamps and full-text search on texts. Our system will be very
write-extensive (about 50million+ new records will need to be stored/indexed
every day and 50million+ old records will need to be deleted). It will not
be very read-extensive (only a few queries a day is expected).

Questions:

  • First of all in general we will be happy to receive any input you will
    find relevant for us.

  • In general, do you believe ES will fit our needs - especially with
    respect to the number of records mentioned above and the read/write balance.

  • Are there some way of controlling which data goes to what shard - e.g.
    that all records where number-A is between 12340000 and 12349999 always goes
    on one specific shard. I believe I have heard somewhere from one of the ES
    guys that it is possible to make all records with a certain timestamp range
    go into the same shard, and that it will be smart to do if you want to
    delete "old" records based on timestamp (it should be easier to delete an
    entire shard that deleting only selected records inside of it), so I guess
    it is possible. Since the 50million+ new records we will receive every day
    will probably all have timestamps from the same day, I believe we will
    certainly not want to configure it so that certain timestamp-ranges go into
    certain shards, because that will make a bottleneck on those few shards (all
    50million+ records of a day will have to be inserted into very few shards).
    Since we are so write-extensive it is important that we are able to utilize
    all power in the cluster for writing, and that we are able to scale with
    respect to writing capabilility by adding more nodes/shards. Any comments on
    that thought? Am I right, or did I miss something?

  • Do you have a rule-of-thumb about the maximum size of a shard, or put in
    another way how many shards will we need to be able to store the 18billion+
    records? I guess that number will basically be dependent on limitations in
    Lucene, where I believe I have heard that you shouldnt go beyond
    10-100million records. Will 100 shards be enough for us, is less enough or
    do we need to consider 1000?

  • Do ES have any kind of rebalancing mechanism? E.g. if a nodes join along
    the way and we dynamically increase the number of shards, will it be able to
    "even out" the data already stored, or will it only have influence on data
    added after that point in time where we add nodes/shards. Guess this will be
    hard if we have put any kind of restrictions on which shard a certain range
    of values need to go to (see above).

  • Reading about persistence in ES I have a hard time figuring out exactly
    how it works - node-local storage vs gateway-storage. Do you have a pointer
    to a thorough description on how persistence work? I general I want to make
    sure that data will be persisted-persisted, when an
    inserting/indexing-process has done a number of inserts (maybe bulk inserts)
    and it finishes. It isnt allowed to be possible that a inserting-process
    thinks that is has inserted a number of records but that they are really not
    persisted-persisted yet. With persisted-persisted I mean that no data will
    disappear even though all nodes in the cluster will stop (e.g. due to global
    power-outage) a split-sec after the process finished, or even though any
    single disk will crash a split-sec after the process finished. So
    persisted-persisted means stored on disk (will survive shutdown of machine)

  • actually stored on at least two disks (redundant). I believe I heard one
    of the ES guys saying something about records/indexes not being
    persisted-persisted unless IndexWriter.commit (or something) of the Lucene
    underneath has been called. If that is true I guess I need a way through ES
    to make sure that this has happened. I also heard that this operation is
    expensive, and that it should therefore not be done too often. I need to
    make sure that it has been done when my insert/index-process finishes (call
    the operation as the last thing in the process), but if it is expensive I
    guess I need to make sure that my insert/index-processes are not too small
    with respect to the number of records that they insert. Any comments on
    that? What will a practical lower limit on number of inserts/indexes that
    have to be done between IndexWriter.commits?

  • How to do updates to a record? Do I need to find the existing record
    (e.g. by id), delete the existing record and insert/index a new record with
    the combined information from the old record and the new information I have
    to add to it? Or are there any other way updating records? What about
    transaction isolation when doing this - if two processes are updating an
    existing record "at the same time" will I be sure that one of them will fail
    and that the other one will succeed?

  • We will have to run a lot of concurrent insert/index-processes. Any
    comments about that? Strategies on how to do that the best way?

  • Auto establishment of redundancy? Will the ES cluster automatically
    notice when an "instance" of a shard (primary or secondary) goes down, and
    automatically start establishing a new "instance" of that shard from the one
    that did not go down, in order to reestablish redundancy of that shard?

  • Any comments on ACID transactions? I would be nice if I can have all the
    ACID-properties in a transaction spanning the entire work of a
    insert/index-process - all or nothing is inserted/indexed, it is consistent
    with respect to what other processes might do concurrently,
    isolation-levels? and durability (guess I have touched that already). Will
    ES be able to participate in distributed XA transactions with
    two-phase-commit etc.

  • The numbers mentioned in the beginning (18billion all in all, 50million
    in/out per day) are only first-attempt-requirements. Basically we need to
    scale to "infinitly" (very far) on those numbers. Are there any known
    limitations on how far a ES cluster can scale - that is, is there a limit
    where I cannot just buy more hardware to support higher numbers?

Regards, Per Steffensen


(Steff) #8

James Cook skrev:

Reply inline

On Wednesday, September 7, 2011 4:36:13 AM UTC-4, Steff wrote:

Questions:

- First of all in general we will be happy to receive any input
you will
find relevant for us.

- In general, do you believe ES will fit our needs - especially with
respect to the number of records mentioned above and the read/write
balance.

I think the size of your documents will have as much bearing as the
count. I am aware of a company that has indexed 4TB of raw data on 16
m2.xlarge EC2 nodes. There may be larger datasets out there.
The documents will span in size from a few hundred bytes to a few
thousand bytes. In average probably about 1000 bytes. So it will
probably be about 18TB of raw data. Will that information enable you (or
others) to give a more detailed answer? I also would like to hear a
little about your comments to the read/write balance - many write and
almost no quering. It might e.g. be that ES was designed with fast
query-response in mind, and not that much with high insert/update/delete
rate in mind (e.g. from a philosophy that inserting is something you do
for all documents once and for all in the beginning, and after that
everything is about quering).

- Are there some way of controlling which data goes to what shard...

http://www.elasticsearch.org/guide/reference/mapping/routing-field.html
Will take a look at that.

The number of shards is not dynamic. You can change the number of
replicas at runtime, but not the number of shards.
Ok, I missed that. Thanks.

- Do you have a rule-of-thumb about the maximum size of a shard,
or put
in another way how many shards will we need to be able to store the
18billion+ records? I guess that number will basically be
dependent on
limitations in Lucene, where I believe I have heard that you
shouldnt go
beyond 10-100million records. Will 100 shards be enough for us, is
less
enough or do we need to consider 1000?

100 shards = 180M records = ? GB
According to the info above = 180GB

There is anecdotal evidence in the mailing list of a user with 6
servers running 128 shards and indexing TB's of data.

- Do ES have any kind of rebalancing mechanism?...

Dynamically adding/removing nodes from an ES cluster is a key feature.
A single node can contain many shards. Shard count is fixed. As new
nodes come and go, the shards are rebalanced across the cluster.
Thanks. That is sufficient answer now that I know that the number of
shards is fixed.

- Reading about persistence in ES I have a hard time figuring out
exactly how it works - node-local storage vs gateway-storage. Do you
have a pointer to a thorough description on how persistence work? I
general I want to make sure that data will be persisted-persisted,
when
an inserting/indexing-process has done a number of inserts (maybe
bulk
inserts) and it finishes. It isnt allowed to be possible that a
inserting-process thinks that is has inserted a number of records but
that they are really not persisted-persisted yet. With
persisted-persisted I mean that no data will disappear even though
all
nodes in the cluster will stop (e.g. due to global power-outage) a
split-sec after the process finished, or even though any single disk
will crash a split-sec after the process finished. So
persisted-persisted means stored on disk (will survive shutdown of
machine) - actually stored on at least two disks (redundant). I
believe
I heard one of the ES guys saying something about records/indexes not
being persisted-persisted unless IndexWriter.commit (or something) of
the Lucene underneath has been called. If that is true I guess I
need a
way through ES to make sure that this has happened. I also heard that
this operation is expensive, and that it should therefore not be done
too often. I need to make sure that it has been done when my
insert/index-process finishes (call the operation as the last
thing in
the process), but if it is expensive I guess I need to make sure
that my
insert/index-processes are not too small with respect to the
number of
records that they insert. Any comments on that? What will a practical
lower limit on number of inserts/indexes that have to be done between
IndexWriter.commits?

As long as the hard disk containing the primary shard or one of its
associated replicas survives, the cluster startup process will be able
to recover the indexed data.
So guess my question converts to: When exactly are data stored on those
disks? Will I be sure that it has been stored on those disks at the
moment I have finished executing the code on e.g.
http://www.elasticsearch.org/guide/reference/java-api/index_.html, or
will it (potentially) live in "memory only" for some period of time
after the code have finished?

- How to do updates to a record? Do I need to find the existing
record
(e.g. by id), delete the existing record and insert/index a new
record
with the combined information from the old record and the new
information I have to add to it? Or are there any other way updating
records? What about transaction isolation when doing this - if two
processes are updating an existing record "at the same time" will
I be
sure that one of them will fail and that the other one will succeed?

To update a record, you read document, make changes to document, index
document. Optimistic concurrency is supported using versioning.
Ok, as I understand your answer there is such a concept as "update" (in
RDMS terminology) in ES. I thought that indexing a document would always
be considered as an "insert" (in RDMS terminology). As I understand you
the "index" operation in ES can be used for both "inserting" and
"updating". But that requires that ES is able to see if a document you
try to index is a "new" document or an "updated" version of an existing
document. Who does ES know if it is one or the other by looking at the
document?

- We will have to run a lot of concurrent insert/index-processes. Any
comments about that? Strategies on how to do that the best way?

That's what ES is built to do, so I don't know what you refer to. If
you are looking for the best rate of indexing, then throw more
indexing threads at the problem. There is a point where ES cannot
handle the load so throttling may be important.
We will ensure throttling, but assuming that we can add all the nodes we
want, are you saying that there is still a know limit on how much can be
loaded into ES per time unit?

- Auto establishment of redundancy? Will the ES cluster automatically
notice when an "instance" of a shard (primary or secondary) goes
down,
and automatically start establishing a new "instance" of that
shard from
the one that did not go down, in order to reestablish redundancy
of that
shard?

ES does not spin up instances of itself. On EC2, we have EC2 monitor
our CPU load on our ES instances and create/destroy instances based on
this metric. The recovery, balancing, etc. is all automatic within ES.
I believe you missed my point, at least in the first part of your
answer, but probably answered it in the end. I was not talking about
starting up new nodes automatically. I was talking about copying a shard
to another existing node if the node where one of the replicas of that
particular shard lived goes down. E.g. lets assume we have three nodes
n1, n2 and n3 with one index consisting of two shards s1 and s2. We run
with 1 replica per shard. s1-primary is running on n1. s1-secondary and
s2-primary is running on n2. s2-secondary is running on n3. Now e.g. n2
goes down, so basically there is only one copy left in the cluster of
both s1 and s2. My question is if n1 and n3 will automatically notice
that n2 went down and start doing the following is order to reestablish
the required level of redundancy (1 replica of all shards):

  • Creating a copy s1-secondary on n3 from s1-primary on n1
  • Creating a copy s2-primary on n1 from s2-secondary on n3
- Any comments on ACID transactions? I would be nice if I can have
all
the ACID-properties in a transaction spanning the entire work of a
insert/index-process - all or nothing is inserted/indexed, it is
consistent with respect to what other processes might do
concurrently,
isolation-levels? and durability (guess I have touched that already).
Will ES be able to participate in distributed XA transactions with
two-phase-commit etc.

I believe Shay has said there are no plans for distributed transactions.
What about non-distributed transactions. How to achieve ACID properties
among many threads operating on the ES cluster as the only resource.
Guess Atomicity can be achieved using Bulk indexing, Consistency can be
achieved by the version based optimistic locking you mentioned and that
the answer that will hopefully come to the question "When exactly are
data stored on those disks?" above will give me details on the Durability.
What about Isolation (levels)?
Regarding Consistency, do I understand you correctly, that re-indexing
of an updated version of an existing document will fail, if the document
has changed (and version number there has changed) between the time the
current thread read it and the time it tries to re-index it?

- The numbers mentioned in the beginning (18billion all in all,
50million in/out per day) are only first-attempt-requirements.
Basically
we need to scale to "infinitly" (very far) on those numbers. Are
there
any known limitations on how far a ES cluster can scale - that is, is
there a limit where I cannot just buy more hardware to support higher
numbers?

Why don't you let us know? :slight_smile: Seriously. I haven't seen anyone say
that they can't handle a particular load yet by adding more nodes.
Will let you know if we ever get so far using ES :slight_smile:

Regards, Per Steffensen

(Steff) #9

James Cook skrev:

When updating, the primary shard (and its replicas) are known to ES
based on the routing id of the document. It doesn't use multicast or
blast a message to all nodes in the cluster.
I was concerned about finding the document that needs to be updated -
not about updating/re-indexing it. For "updating" to scale I need to be
able to find the document that I need to update without asking all
shards if they have the document. Basically that reduces to being able
to do get-operations that only consult the one shard where the document
lives if it exists. That at least requires that the search-criteria I
use on the get-operation are "strict" enough to only retrieve 0 or 1
document. But if I am able to make such strict search-criteria, is it
possible to do get-operations that does not consult all shards in the
cluster, but only consults one single shard that contains the document
we are searching for if it exists.


(Steff) #10

Berkay Mollamustafaoglu skrev:

In ES you have several options for sharding:

* you can specify number of shards (fixed number) per index, and
  let ES decide how to distribute the docs to the shards, can be
  called "automatic sharding"
* you can use automatic sharding but control which docs go to
  which shard thru "routing"
* you can create indices yourself and use aliases. ES allows you
  to use aliases or specify which indices a query should be
  executed against. sort of self managed sharding 

If you will keep adding new data into the index in time, I think
option 3, self controlled sharding, is more suitable than the
automatic sharding. You can control from how to segment the data in
your code, based on number of docs, size of index etc. date, etc. you
can also control which indices a query should be executed against. For
example, if you split the data based on date, say an index per week,
etc. you can add some intelligence to your queries to direct the query
only to the indexes that are relevant. You can also use combination of
automated and controlled sharding, create an index per week and have
each index have 2-3 shards, etc.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

Very interesting information. I never understood why you would ever have
more than one index, now that documents are untyped. But at least you
explained one reason for me now. Basically both shards and indices can
be used for partitioning documents in isolated chunks. Using shards
within the same index is the easy way because it is very automated.
Using many different indices requires more work because it is more
manual, but also provide more "power" to do "special/advanced" stuff
needed for your particular system. And the methods can even be mixed. Nice.


(Steff) #11

Thanks a lot to both of you for taking the time to answer my questions!


(Clinton Gormley) #12

Hi Per

We are in the startup-phase of a projekt where we consider using ES. I
have a few questions that I would please like someone to answer before
we can proceed.

To add to what James and Berkay have already said, it would be well
worth your time to watch the presentation that kimchy did at Berlin
Buzzwords:

http://www.elasticsearch.org/videos/2011/08/09/road-to-a-distributed-searchengine-berlinbuzzwords.html

clint


(Steff) #13

Clinton Gormley skrev:

Hi Per

We are in the startup-phase of a projekt where we consider using ES. I
have a few questions that I would please like someone to answer before
we can proceed.

To add to what James and Berkay have already said, it would be well
worth your time to watch the presentation that kimchy did at Berlin
Buzzwords:

http://www.elasticsearch.org/videos/2011/08/09/road-to-a-distributed-searchengine-berlinbuzzwords.html

Thanks. I did watch that already, but I believe most of my questions are
not or only partly answered in that video. Good presentation BTW.

clint


(Shay Banon) #14

Heya,

Great thread. Per, I lost track of possible open questions, if you still
have some, can you open new threads for them?

On Thu, Sep 8, 2011 at 12:03 PM, Per Steffensen steff@designware.dk wrote:

**
Clinton Gormley skrev:

Hi Per

We are in the startup-phase of a projekt where we consider using ES. I
have a few questions that I would please like someone to answer before
we can proceed.

To add to what James and Berkay have already said, it would be well
worth your time to watch the presentation that kimchy did at Berlin
Buzzwords:
http://www.elasticsearch.org/videos/2011/08/09/road-to-a-distributed-searchengine-berlinbuzzwords.html

Thanks. I did watch that already, but I believe most of my questions are
not or only partly answered in that video. Good presentation BTW.

clint


(Steff) #15

Shay Banon skrev:

Heya,

Great thread. Per, I lost track of possible open questions, if you
still have some, can you open new threads for them?

Thanks for replying, Shay.

I have now created new separate threads for (most of the) remaining
questions. For those interested in following the threads:

For some of the issues I have lowered my ambitions with respect to the
open-ness of the questions - some of the questions have become more
specific. I guess I can always RTFC (Read The F...... Code) :slight_smile:

Regards, Per Steffensen


(system) #16