Questions from a newbie

Steff · September 7, 2011, 8:36am

Hi

We are in the startup-phase of a projekt where we consider using ES. I
have a few questions that I would please like someone to answer before
we can proceed.

A little bit about our needs:
We need to store 18billion+ records consisting of stuff like numbers,
timestamps and text. We need to be able to do searches including
prefix-search on the numbers (e.g. 123* should find 12345678),
range-search on timestamps and full-text search on texts. Our system
will be very write-extensive (about 50million+ new records will need to
be stored/indexed every day and 50million+ old records will need to be
deleted). It will not be very read-extensive (only a few queries a day
is expected).

Questions:

First of all in general we will be happy to receive any input you will
find relevant for us.
In general, do you believe ES will fit our needs - especially with
respect to the number of records mentioned above and the read/write
balance.
Are there some way of controlling which data goes to what shard - e.g.
that all records where number-A is between 12340000 and 12349999 always
goes on one specific shard. I believe I have heard somewhere from one of
the ES guys that it is possible to make all records with a certain
timestamp range go into the same shard, and that it will be smart to do
if you want to delete "old" records based on timestamp (it should be
easier to delete an entire shard that deleting only selected records
inside of it), so I guess it is possible. Since the 50million+ new
records we will receive every day will probably all have timestamps from
the same day, I believe we will certainly not want to configure it so
that certain timestamp-ranges go into certain shards, because that will
make a bottleneck on those few shards (all 50million+ records of a day
will have to be inserted into very few shards). Since we are so
write-extensive it is important that we are able to utilize all power in
the cluster for writing, and that we are able to scale with respect to
writing capabilility by adding more nodes/shards. Any comments on that
thought? Am I right, or did I miss something?
Do you have a rule-of-thumb about the maximum size of a shard, or put
in another way how many shards will we need to be able to store the
18billion+ records? I guess that number will basically be dependent on
limitations in Lucene, where I believe I have heard that you shouldnt go
beyond 10-100million records. Will 100 shards be enough for us, is less
enough or do we need to consider 1000?
Do ES have any kind of rebalancing mechanism? E.g. if a nodes join
along the way and we dynamically increase the number of shards, will it
be able to "even out" the data already stored, or will it only have
influence on data added after that point in time where we add
nodes/shards. Guess this will be hard if we have put any kind of
restrictions on which shard a certain range of values need to go to (see
above).
Reading about persistence in ES I have a hard time figuring out
exactly how it works - node-local storage vs gateway-storage. Do you
have a pointer to a thorough description on how persistence work? I
general I want to make sure that data will be persisted-persisted, when
an inserting/indexing-process has done a number of inserts (maybe bulk
inserts) and it finishes. It isnt allowed to be possible that a
inserting-process thinks that is has inserted a number of records but
that they are really not persisted-persisted yet. With
persisted-persisted I mean that no data will disappear even though all
nodes in the cluster will stop (e.g. due to global power-outage) a
split-sec after the process finished, or even though any single disk
will crash a split-sec after the process finished. So
persisted-persisted means stored on disk (will survive shutdown of
machine) - actually stored on at least two disks (redundant). I believe
I heard one of the ES guys saying something about records/indexes not
being persisted-persisted unless IndexWriter.commit (or something) of
the Lucene underneath has been called. If that is true I guess I need a
way through ES to make sure that this has happened. I also heard that
this operation is expensive, and that it should therefore not be done
too often. I need to make sure that it has been done when my
insert/index-process finishes (call the operation as the last thing in
the process), but if it is expensive I guess I need to make sure that my
insert/index-processes are not too small with respect to the number of
records that they insert. Any comments on that? What will a practical
lower limit on number of inserts/indexes that have to be done between
IndexWriter.commits?
How to do updates to a record? Do I need to find the existing record
(e.g. by id), delete the existing record and insert/index a new record
with the combined information from the old record and the new
information I have to add to it? Or are there any other way updating
records? What about transaction isolation when doing this - if two
processes are updating an existing record "at the same time" will I be
sure that one of them will fail and that the other one will succeed?
We will have to run a lot of concurrent insert/index-processes. Any
comments about that? Strategies on how to do that the best way?
Auto establishment of redundancy? Will the ES cluster automatically
notice when an "instance" of a shard (primary or secondary) goes down,
and automatically start establishing a new "instance" of that shard from
the one that did not go down, in order to reestablish redundancy of that
shard?
Any comments on ACID transactions? I would be nice if I can have all
the ACID-properties in a transaction spanning the entire work of a
insert/index-process - all or nothing is inserted/indexed, it is
consistent with respect to what other processes might do concurrently,
isolation-levels? and durability (guess I have touched that already).
Will ES be able to participate in distributed XA transactions with
two-phase-commit etc.
The numbers mentioned in the beginning (18billion all in all,
50million in/out per day) are only first-attempt-requirements. Basically
we need to scale to "infinitly" (very far) on those numbers. Are there
any known limitations on how far a ES cluster can scale - that is, is
there a limit where I cannot just buy more hardware to support higher
numbers?

Regards, Per Steffensen

Steff · September 7, 2011, 9:35am

An additional question:

As I understand it ES will automatically (maybe controlled by some
config by us) chose the shard where a new record will be
inserted/indexed, as opposed to Solr/Solandra where the distribution
among shards is more manuel. It this correctly understood? Our
insert/index-processes will (potentially) run on the same physical
machines that participate in the ES cluster, and it might be a good idea
(with respect to network overhead) that it inserts/indexes into a shard
that happens to be running on the same physical machine. Any comments on
that? Do we have any control-buttons for this area in ES?

Regards, Per Steffensen

Per Steffensen skrev:

Hi

We are in the startup-phase of a projekt where we consider using ES. I
have a few questions that I would please like someone to answer before
we can proceed.

A little bit about our needs:
We need to store 18billion+ records consisting of stuff like numbers,
timestamps and text. We need to be able to do searches including
prefix-search on the numbers (e.g. 123* should find 12345678),
range-search on timestamps and full-text search on texts. Our system
will be very write-extensive (about 50million+ new records will need
to be stored/indexed every day and 50million+ old records will need to
be deleted). It will not be very read-extensive (only a few queries a
day is expected).

Questions:

First of all in general we will be happy to receive any input you
will find relevant for us.

In general, do you believe ES will fit our needs - especially with
respect to the number of records mentioned above and the read/write
balance.

Are there some way of controlling which data goes to what shard -
e.g. that all records where number-A is between 12340000 and 12349999
always goes on one specific shard. I believe I have heard somewhere
from one of the ES guys that it is possible to make all records with a
certain timestamp range go into the same shard, and that it will be
smart to do if you want to delete "old" records based on timestamp (it
should be easier to delete an entire shard that deleting only selected
records inside of it), so I guess it is possible. Since the 50million+
new records we will receive every day will probably all have
timestamps from the same day, I believe we will certainly not want to
configure it so that certain timestamp-ranges go into certain shards,
because that will make a bottleneck on those few shards (all
50million+ records of a day will have to be inserted into very few
shards). Since we are so write-extensive it is important that we are
able to utilize all power in the cluster for writing, and that we are
able to scale with respect to writing capabilility by adding more
nodes/shards. Any comments on that thought? Am I right, or did I miss
something?

Do you have a rule-of-thumb about the maximum size of a shard, or
put in another way how many shards will we need to be able to store
the 18billion+ records? I guess that number will basically be
dependent on limitations in Lucene, where I believe I have heard that
you shouldnt go beyond 10-100million records. Will 100 shards be
enough for us, is less enough or do we need to consider 1000?

Do ES have any kind of rebalancing mechanism? E.g. if a nodes join
along the way and we dynamically increase the number of shards, will
it be able to "even out" the data already stored, or will it only have
influence on data added after that point in time where we add
nodes/shards. Guess this will be hard if we have put any kind of
restrictions on which shard a certain range of values need to go to
(see above).

Reading about persistence in ES I have a hard time figuring out
exactly how it works - node-local storage vs gateway-storage. Do you
have a pointer to a thorough description on how persistence work? I
general I want to make sure that data will be persisted-persisted,
when an inserting/indexing-process has done a number of inserts (maybe
bulk inserts) and it finishes. It isnt allowed to be possible that a
inserting-process thinks that is has inserted a number of records but
that they are really not persisted-persisted yet. With
persisted-persisted I mean that no data will disappear even though all
nodes in the cluster will stop (e.g. due to global power-outage) a
split-sec after the process finished, or even though any single disk
will crash a split-sec after the process finished. So
persisted-persisted means stored on disk (will survive shutdown of
machine) - actually stored on at least two disks (redundant). I
believe I heard one of the ES guys saying something about
records/indexes not being persisted-persisted unless
IndexWriter.commit (or something) of the Lucene underneath has been
called. If that is true I guess I need a way through ES to make sure
that this has happened. I also heard that this operation is expensive,
and that it should therefore not be done too often. I need to make
sure that it has been done when my insert/index-process finishes (call
the operation as the last thing in the process), but if it is
expensive I guess I need to make sure that my insert/index-processes
are not too small with respect to the number of records that they
insert. Any comments on that? What will a practical lower limit on
number of inserts/indexes that have to be done between
IndexWriter.commits?

How to do updates to a record? Do I need to find the existing record
(e.g. by id), delete the existing record and insert/index a new record
with the combined information from the old record and the new
information I have to add to it? Or are there any other way updating
records? What about transaction isolation when doing this - if two
processes are updating an existing record "at the same time" will I be
sure that one of them will fail and that the other one will succeed?

We will have to run a lot of concurrent insert/index-processes. Any
comments about that? Strategies on how to do that the best way?

Auto establishment of redundancy? Will the ES cluster automatically
notice when an "instance" of a shard (primary or secondary) goes down,
and automatically start establishing a new "instance" of that shard
from the one that did not go down, in order to reestablish redundancy
of that shard?

Any comments on ACID transactions? I would be nice if I can have all
the ACID-properties in a transaction spanning the entire work of a
insert/index-process - all or nothing is inserted/indexed, it is
consistent with respect to what other processes might do concurrently,
isolation-levels? and durability (guess I have touched that already).
Will ES be able to participate in distributed XA transactions with
two-phase-commit etc.

The numbers mentioned in the beginning (18billion all in all,
50million in/out per day) are only first-attempt-requirements.
Basically we need to scale to "infinitly" (very far) on those numbers.
Are there any known limitations on how far a ES cluster can scale -
that is, is there a limit where I cannot just buy more hardware to
support higher numbers?

Regards, Per Steffensen

Steff · September 7, 2011, 1:43pm

Per Steffensen skrev:

An additional question:

As I understand it ES will automatically (maybe controlled by some
config by us) chose the shard where a new record will be
inserted/indexed, as opposed to Solr/Solandra where the distribution
among shards is more manuel. It this correctly understood? Our
insert/index-processes will (potentially) run on the same physical
machines that participate in the ES cluster, and it might be a good
idea (with respect to network overhead) that it inserts/indexes into a
shard that happens to be running on the same physical machine. Any
comments on that? Do we have any control-buttons for this area in ES?

Regards, Per Steffensen

Per Steffensen skrev:

Hi

We are in the startup-phase of a projekt where we consider using ES.
I have a few questions that I would please like someone to answer
before we can proceed.

A little bit about our needs:
We need to store 18billion+ records consisting of stuff like numbers,
timestamps and text. We need to be able to do searches including
prefix-search on the numbers (e.g. 123* should find 12345678),
range-search on timestamps and full-text search on texts. Our system
will be very write-extensive (about 50million+ new records will need
to be stored/indexed every day and 50million+ old records will need
to be deleted). It will not be very read-extensive (only a few
queries a day is expected).

Questions:

First of all in general we will be happy to receive any input you
will find relevant for us.

In general, do you believe ES will fit our needs - especially with
respect to the number of records mentioned above and the read/write
balance.

Are there some way of controlling which data goes to what shard -
e.g. that all records where number-A is between 12340000 and 12349999
always goes on one specific shard. I believe I have heard somewhere
from one of the ES guys that it is possible to make all records with
a certain timestamp range go into the same shard, and that it will be
smart to do if you want to delete "old" records based on timestamp
(it should be easier to delete an entire shard that deleting only
selected records inside of it), so I guess it is possible. Since the
50million+ new records we will receive every day will probably all
have timestamps from the same day, I believe we will certainly not
want to configure it so that certain timestamp-ranges go into certain
shards, because that will make a bottleneck on those few shards (all
50million+ records of a day will have to be inserted into very few
shards). Since we are so write-extensive it is important that we are
able to utilize all power in the cluster for writing, and that we are
able to scale with respect to writing capabilility by adding more
nodes/shards. Any comments on that thought? Am I right, or did I miss
something?

Do you have a rule-of-thumb about the maximum size of a shard, or
put in another way how many shards will we need to be able to store
the 18billion+ records? I guess that number will basically be
dependent on limitations in Lucene, where I believe I have heard that
you shouldnt go beyond 10-100million records. Will 100 shards be
enough for us, is less enough or do we need to consider 1000?

Do ES have any kind of rebalancing mechanism? E.g. if a nodes join
along the way and we dynamically increase the number of shards, will
it be able to "even out" the data already stored, or will it only
have influence on data added after that point in time where we add
nodes/shards. Guess this will be hard if we have put any kind of
restrictions on which shard a certain range of values need to go to
(see above).

Reading about persistence in ES I have a hard time figuring out
exactly how it works - node-local storage vs gateway-storage. Do you
have a pointer to a thorough description on how persistence work? I
general I want to make sure that data will be persisted-persisted,
when an inserting/indexing-process has done a number of inserts
(maybe bulk inserts) and it finishes. It isnt allowed to be possible
that a inserting-process thinks that is has inserted a number of
records but that they are really not persisted-persisted yet. With
persisted-persisted I mean that no data will disappear even though
all nodes in the cluster will stop (e.g. due to global power-outage)
a split-sec after the process finished, or even though any single
disk will crash a split-sec after the process finished. So
persisted-persisted means stored on disk (will survive shutdown of
machine) - actually stored on at least two disks (redundant). I
believe I heard one of the ES guys saying something about
records/indexes not being persisted-persisted unless
IndexWriter.commit (or something) of the Lucene underneath has been
called. If that is true I guess I need a way through ES to make sure
that this has happened. I also heard that this operation is
expensive, and that it should therefore not be done too often. I need
to make sure that it has been done when my insert/index-process
finishes (call the operation as the last thing in the process), but
if it is expensive I guess I need to make sure that my
insert/index-processes are not too small with respect to the number
of records that they insert. Any comments on that? What will a
practical lower limit on number of inserts/indexes that have to be
done between IndexWriter.commits?

How to do updates to a record? Do I need to find the existing
record (e.g. by id), delete the existing record and insert/index a
new record with the combined information from the old record and the
new information I have to add to it? Or are there any other way
updating records? What about transaction isolation when doing this -
if two processes are updating an existing record "at the same time"
will I be sure that one of them will fail and that the other one will
succeed?
When updating I need to be able to find the record that has to be
updated without involving all shards, or else I will not be able to
scale in number-of-possible-updates-per-time-unit - that is, I will not
be able to just buy more hardware to be able to support more
updates-per-time-unit, just as I expect to be able to support more
inserts-per-time-unit by buying more hardware. When I want to update I
know that only 0 or 1 record will exist living up to the
search-criterias I will use to find the record to be updated, and that
the query will therefore return a resultset of size 0 or 1. In order to
not involve all shards for such queries, there need to be some kind of
configuration (the same as the one controlling the destination of a new
record among shards) that ES is able to take into consideration when
performing the search - only ask the one shard where it know the record
will exist if it exists. What kind solutions do you have in this area?
It this possible? Only on id's of the records? Or?

We will have to run a lot of concurrent insert/index-processes. Any
comments about that? Strategies on how to do that the best way?

Auto establishment of redundancy? Will the ES cluster automatically
notice when an "instance" of a shard (primary or secondary) goes
down, and automatically start establishing a new "instance" of that
shard from the one that did not go down, in order to reestablish
redundancy of that shard?

Any comments on ACID transactions? I would be nice if I can have
all the ACID-properties in a transaction spanning the entire work of
a insert/index-process - all or nothing is inserted/indexed, it is
consistent with respect to what other processes might do
concurrently, isolation-levels? and durability (guess I have touched
that already). Will ES be able to participate in distributed XA
transactions with two-phase-commit etc.

The numbers mentioned in the beginning (18billion all in all,
50million in/out per day) are only first-attempt-requirements.
Basically we need to scale to "infinitly" (very far) on those
numbers. Are there any known limitations on how far a ES cluster can
scale - that is, is there a limit where I cannot just buy more
hardware to support higher numbers?

Regards, Per Steffensen

James_Cook · September 7, 2011, 2:00pm

Reply inline

On Wednesday, September 7, 2011 4:36:13 AM UTC-4, Steff wrote:

Questions:

First of all in general we will be happy to receive any input you will
find relevant for us.

In general, do you believe ES will fit our needs - especially with
respect to the number of records mentioned above and the read/write
balance.

I think the size of your documents will have as much bearing as the count. I
am aware of a company that has indexed 4TB of raw data on 16 m2.xlarge EC2
nodes. There may be larger datasets out there.

Are there some way of controlling which data goes to what shard - e.g.
that all records where number-A is between 12340000 and 12349999 always
goes on one specific shard. I believe I have heard somewhere from one of
the ES guys that it is possible to make all records with a certain
timestamp range go into the same shard, and that it will be smart to do
if you want to delete "old" records based on timestamp (it should be
easier to delete an entire shard that deleting only selected records
inside of it), so I guess it is possible. Since the 50million+ new
records we will receive every day will probably all have timestamps from
the same day, I believe we will certainly not want to configure it so
that certain timestamp-ranges go into certain shards, because that will
make a bottleneck on those few shards (all 50million+ records of a day
will have to be inserted into very few shards). Since we are so
write-extensive it is important that we are able to utilize all power in
the cluster for writing, and that we are able to scale with respect to
writing capabilility by adding more nodes/shards. Any comments on that
thought? Am I right, or did I miss something?

The number of shards is not dynamic. You can change the number of replicas
at runtime, but not the number of shards.

Do you have a rule-of-thumb about the maximum size of a shard, or put
in another way how many shards will we need to be able to store the
18billion+ records? I guess that number will basically be dependent on
limitations in Lucene, where I believe I have heard that you shouldnt go
beyond 10-100million records. Will 100 shards be enough for us, is less
enough or do we need to consider 1000?

100 shards = 180M records = ? GB

There is anecdotal evidence in the mailing list of a user with 6 servers
running 128 shards and indexing TB's of data.

Do ES have any kind of rebalancing mechanism? E.g. if a nodes join
along the way and we dynamically increase the number of shards, will it
be able to "even out" the data already stored, or will it only have
influence on data added after that point in time where we add
nodes/shards. Guess this will be hard if we have put any kind of
restrictions on which shard a certain range of values need to go to (see
above).

Dynamically adding/removing nodes from an ES cluster is a key feature. A
single node can contain many shards. Shard count is fixed. As new nodes come
and go, the shards are rebalanced across the cluster.

Reading about persistence in ES I have a hard time figuring out
exactly how it works - node-local storage vs gateway-storage. Do you
have a pointer to a thorough description on how persistence work? I
general I want to make sure that data will be persisted-persisted, when
an inserting/indexing-process has done a number of inserts (maybe bulk
inserts) and it finishes. It isnt allowed to be possible that a
inserting-process thinks that is has inserted a number of records but
that they are really not persisted-persisted yet. With
persisted-persisted I mean that no data will disappear even though all
nodes in the cluster will stop (e.g. due to global power-outage) a
split-sec after the process finished, or even though any single disk
will crash a split-sec after the process finished. So
persisted-persisted means stored on disk (will survive shutdown of
machine) - actually stored on at least two disks (redundant). I believe
I heard one of the ES guys saying something about records/indexes not
being persisted-persisted unless IndexWriter.commit (or something) of
the Lucene underneath has been called. If that is true I guess I need a
way through ES to make sure that this has happened. I also heard that
this operation is expensive, and that it should therefore not be done
too often. I need to make sure that it has been done when my
insert/index-process finishes (call the operation as the last thing in
the process), but if it is expensive I guess I need to make sure that my
insert/index-processes are not too small with respect to the number of
records that they insert. Any comments on that? What will a practical
lower limit on number of inserts/indexes that have to be done between
IndexWriter.commits?

As long as the hard disk containing the primary shard or one of its
associated replicas survives, the cluster startup process will be able to
recover the indexed data.

How to do updates to a record? Do I need to find the existing record
(e.g. by id), delete the existing record and insert/index a new record
with the combined information from the old record and the new
information I have to add to it? Or are there any other way updating
records? What about transaction isolation when doing this - if two
processes are updating an existing record "at the same time" will I be
sure that one of them will fail and that the other one will succeed?

To update a record, you read document, make changes to document, index
document. Optimistic concurrency is supported using versioning.

We will have to run a lot of concurrent insert/index-processes. Any
comments about that? Strategies on how to do that the best way?

That's what ES is built to do, so I don't know what you refer to. If you are
looking for the best rate of indexing, then throw more indexing threads at
the problem. There is a point where ES cannot handle the load so throttling
may be important.

Auto establishment of redundancy? Will the ES cluster automatically
notice when an "instance" of a shard (primary or secondary) goes down,
and automatically start establishing a new "instance" of that shard from
the one that did not go down, in order to reestablish redundancy of that
shard?

ES does not spin up instances of itself. On EC2, we have EC2 monitor our CPU
load on our ES instances and create/destroy instances based on this metric.
The recovery, balancing, etc. is all automatic within ES.

Any comments on ACID transactions? I would be nice if I can have all
the ACID-properties in a transaction spanning the entire work of a
insert/index-process - all or nothing is inserted/indexed, it is
consistent with respect to what other processes might do concurrently,
isolation-levels? and durability (guess I have touched that already).
Will ES be able to participate in distributed XA transactions with
two-phase-commit etc.

I believe Shay has said there are no plans for distributed transactions.

The numbers mentioned in the beginning (18billion all in all,
50million in/out per day) are only first-attempt-requirements. Basically
we need to scale to "infinitly" (very far) on those numbers. Are there
any known limitations on how far a ES cluster can scale - that is, is
there a limit where I cannot just buy more hardware to support higher
numbers?

Why don't you let us know? Seriously. I haven't seen anyone say that they
can't handle a particular load yet by adding more nodes.

Regards, Per Steffensen

James_Cook · September 7, 2011, 2:04pm

You don't get to control which node or shard the document considers its
master, but you can provide a value to base your routing on.

http://www.elasticsearch.org/guide/reference/mapping/routing-field.html

The only thing you know for sure is that documents with the same routing
value go to the same shard.

James_Cook · September 7, 2011, 2:07pm

When updating, the primary shard (and its replicas) are known to ES based on
the routing id of the document. It doesn't use multicast or blast a message
to all nodes in the cluster.

Berkay_Mollamustafao · September 7, 2011, 2:28pm

In ES you have several options for sharding:

you can specify number of shards (fixed number) per index, and let ES
decide how to distribute the docs to the shards, can be called "automatic
sharding"
you can use automatic sharding but control which docs go to which shard
thru "routing"
you can create indices yourself and use aliases. ES allows you to use
aliases or specify which indices a query should be executed against. sort of
self managed sharding

If you will keep adding new data into the index in time, I think option 3,
self controlled sharding, is more suitable than the automatic sharding. You
can control from how to segment the data in your code, based on number of
docs, size of index etc. date, etc. you can also control which indices a
query should be executed against. For example, if you split the data based
on date, say an index per week, etc. you can add some intelligence to your
queries to direct the query only to the indexes that are relevant. You can
also use combination of automated and controlled sharding, create an index
per week and have each index have 2-3 shards, etc.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Wed, Sep 7, 2011 at 4:36 AM, Per Steffensen steff@designware.dk wrote:

Hi

We are in the startup-phase of a projekt where we consider using ES. I have
a few questions that I would please like someone to answer before we can
proceed.

A little bit about our needs:
We need to store 18billion+ records consisting of stuff like numbers,
timestamps and text. We need to be able to do searches including
prefix-search on the numbers (e.g. 123* should find 12345678), range-search
on timestamps and full-text search on texts. Our system will be very
write-extensive (about 50million+ new records will need to be stored/indexed
every day and 50million+ old records will need to be deleted). It will not
be very read-extensive (only a few queries a day is expected).

Questions:

First of all in general we will be happy to receive any input you will
find relevant for us.

In general, do you believe ES will fit our needs - especially with
respect to the number of records mentioned above and the read/write balance.

Are there some way of controlling which data goes to what shard - e.g.
that all records where number-A is between 12340000 and 12349999 always goes
on one specific shard. I believe I have heard somewhere from one of the ES
guys that it is possible to make all records with a certain timestamp range
go into the same shard, and that it will be smart to do if you want to
delete "old" records based on timestamp (it should be easier to delete an
entire shard that deleting only selected records inside of it), so I guess
it is possible. Since the 50million+ new records we will receive every day
will probably all have timestamps from the same day, I believe we will
certainly not want to configure it so that certain timestamp-ranges go into
certain shards, because that will make a bottleneck on those few shards (all
50million+ records of a day will have to be inserted into very few shards).
Since we are so write-extensive it is important that we are able to utilize
all power in the cluster for writing, and that we are able to scale with
respect to writing capabilility by adding more nodes/shards. Any comments on
that thought? Am I right, or did I miss something?

Do you have a rule-of-thumb about the maximum size of a shard, or put in
another way how many shards will we need to be able to store the 18billion+
records? I guess that number will basically be dependent on limitations in
Lucene, where I believe I have heard that you shouldnt go beyond
10-100million records. Will 100 shards be enough for us, is less enough or
do we need to consider 1000?

Do ES have any kind of rebalancing mechanism? E.g. if a nodes join along
the way and we dynamically increase the number of shards, will it be able to
"even out" the data already stored, or will it only have influence on data
added after that point in time where we add nodes/shards. Guess this will be
hard if we have put any kind of restrictions on which shard a certain range
of values need to go to (see above).

Reading about persistence in ES I have a hard time figuring out exactly
how it works - node-local storage vs gateway-storage. Do you have a pointer
to a thorough description on how persistence work? I general I want to make
sure that data will be persisted-persisted, when an
inserting/indexing-process has done a number of inserts (maybe bulk inserts)
and it finishes. It isnt allowed to be possible that a inserting-process
thinks that is has inserted a number of records but that they are really not
persisted-persisted yet. With persisted-persisted I mean that no data will
disappear even though all nodes in the cluster will stop (e.g. due to global
power-outage) a split-sec after the process finished, or even though any
single disk will crash a split-sec after the process finished. So
persisted-persisted means stored on disk (will survive shutdown of machine)

actually stored on at least two disks (redundant). I believe I heard one
of the ES guys saying something about records/indexes not being
persisted-persisted unless IndexWriter.commit (or something) of the Lucene
underneath has been called. If that is true I guess I need a way through ES
to make sure that this has happened. I also heard that this operation is
expensive, and that it should therefore not be done too often. I need to
make sure that it has been done when my insert/index-process finishes (call
the operation as the last thing in the process), but if it is expensive I
guess I need to make sure that my insert/index-processes are not too small
with respect to the number of records that they insert. Any comments on
that? What will a practical lower limit on number of inserts/indexes that
have to be done between IndexWriter.commits?

How to do updates to a record? Do I need to find the existing record
(e.g. by id), delete the existing record and insert/index a new record with
the combined information from the old record and the new information I have
to add to it? Or are there any other way updating records? What about
transaction isolation when doing this - if two processes are updating an
existing record "at the same time" will I be sure that one of them will fail
and that the other one will succeed?

We will have to run a lot of concurrent insert/index-processes. Any
comments about that? Strategies on how to do that the best way?

Auto establishment of redundancy? Will the ES cluster automatically
notice when an "instance" of a shard (primary or secondary) goes down, and
automatically start establishing a new "instance" of that shard from the one
that did not go down, in order to reestablish redundancy of that shard?

Any comments on ACID transactions? I would be nice if I can have all the
ACID-properties in a transaction spanning the entire work of a
insert/index-process - all or nothing is inserted/indexed, it is consistent
with respect to what other processes might do concurrently,
isolation-levels? and durability (guess I have touched that already). Will
ES be able to participate in distributed XA transactions with
two-phase-commit etc.

The numbers mentioned in the beginning (18billion all in all, 50million
in/out per day) are only first-attempt-requirements. Basically we need to
scale to "infinitly" (very far) on those numbers. Are there any known
limitations on how far a ES cluster can scale - that is, is there a limit
where I cannot just buy more hardware to support higher numbers?

Regards, Per Steffensen

Steff · September 7, 2011, 6:50pm

James Cook skrev:

Reply inline

On Wednesday, September 7, 2011 4:36:13 AM UTC-4, Steff wrote:
Questions:

- First of all in general we will be happy to receive any input
you will
find relevant for us.

- In general, do you believe ES will fit our needs - especially with
respect to the number of records mentioned above and the read/write
balance.
I think the size of your documents will have as much bearing as the
count. I am aware of a company that has indexed 4TB of raw data on 16
m2.xlarge EC2 nodes. There may be larger datasets out there.
The documents will span in size from a few hundred bytes to a few
thousand bytes. In average probably about 1000 bytes. So it will
probably be about 18TB of raw data. Will that information enable you (or
others) to give a more detailed answer? I also would like to hear a
little about your comments to the read/write balance - many write and
almost no quering. It might e.g. be that ES was designed with fast
query-response in mind, and not that much with high insert/update/delete
rate in mind (e.g. from a philosophy that inserting is something you do
for all documents once and for all in the beginning, and after that
everything is about quering).
- Are there some way of controlling which data goes to what shard...
Elasticsearch Platform — Find real-time answers at scale | Elastic
Will take a look at that.

The number of shards is not dynamic. You can change the number of
replicas at runtime, but not the number of shards.
Ok, I missed that. Thanks.
- Do you have a rule-of-thumb about the maximum size of a shard,
or put
in another way how many shards will we need to be able to store the
18billion+ records? I guess that number will basically be
dependent on
limitations in Lucene, where I believe I have heard that you
shouldnt go
beyond 10-100million records. Will 100 shards be enough for us, is
less
enough or do we need to consider 1000?
100 shards = 180M records = ? GB
According to the info above = 180GB

There is anecdotal evidence in the mailing list of a user with 6
servers running 128 shards and indexing TB's of data.
- Do ES have any kind of rebalancing mechanism?...
Dynamically adding/removing nodes from an ES cluster is a key feature.
A single node can contain many shards. Shard count is fixed. As new
nodes come and go, the shards are rebalanced across the cluster.
Thanks. That is sufficient answer now that I know that the number of
shards is fixed.
- Reading about persistence in ES I have a hard time figuring out
exactly how it works - node-local storage vs gateway-storage. Do you
have a pointer to a thorough description on how persistence work? I
general I want to make sure that data will be persisted-persisted,
when
an inserting/indexing-process has done a number of inserts (maybe
bulk
inserts) and it finishes. It isnt allowed to be possible that a
inserting-process thinks that is has inserted a number of records but
that they are really not persisted-persisted yet. With
persisted-persisted I mean that no data will disappear even though
all
nodes in the cluster will stop (e.g. due to global power-outage) a
split-sec after the process finished, or even though any single disk
will crash a split-sec after the process finished. So
persisted-persisted means stored on disk (will survive shutdown of
machine) - actually stored on at least two disks (redundant). I
believe
I heard one of the ES guys saying something about records/indexes not
being persisted-persisted unless IndexWriter.commit (or something) of
the Lucene underneath has been called. If that is true I guess I
need a
way through ES to make sure that this has happened. I also heard that
this operation is expensive, and that it should therefore not be done
too often. I need to make sure that it has been done when my
insert/index-process finishes (call the operation as the last
thing in
the process), but if it is expensive I guess I need to make sure
that my
insert/index-processes are not too small with respect to the
number of
records that they insert. Any comments on that? What will a practical
lower limit on number of inserts/indexes that have to be done between
IndexWriter.commits?
As long as the hard disk containing the primary shard or one of its
associated replicas survives, the cluster startup process will be able
to recover the indexed data.
So guess my question converts to: When exactly are data stored on those
disks? Will I be sure that it has been stored on those disks at the
moment I have finished executing the code on e.g.
Elasticsearch Platform — Find real-time answers at scale | Elastic, or
will it (potentially) live in "memory only" for some period of time
after the code have finished?
- How to do updates to a record? Do I need to find the existing
record
(e.g. by id), delete the existing record and insert/index a new
record
with the combined information from the old record and the new
information I have to add to it? Or are there any other way updating
records? What about transaction isolation when doing this - if two
processes are updating an existing record "at the same time" will
I be
sure that one of them will fail and that the other one will succeed?
To update a record, you read document, make changes to document, index
document. Optimistic concurrency is supported using versioning.
Ok, as I understand your answer there is such a concept as "update" (in
RDMS terminology) in ES. I thought that indexing a document would always
be considered as an "insert" (in RDMS terminology). As I understand you
the "index" operation in ES can be used for both "inserting" and
"updating". But that requires that ES is able to see if a document you
try to index is a "new" document or an "updated" version of an existing
document. Who does ES know if it is one or the other by looking at the
document?
- We will have to run a lot of concurrent insert/index-processes. Any
comments about that? Strategies on how to do that the best way?
That's what ES is built to do, so I don't know what you refer to. If
you are looking for the best rate of indexing, then throw more
indexing threads at the problem. There is a point where ES cannot
handle the load so throttling may be important.
We will ensure throttling, but assuming that we can add all the nodes we
want, are you saying that there is still a know limit on how much can be
loaded into ES per time unit?
- Auto establishment of redundancy? Will the ES cluster automatically
notice when an "instance" of a shard (primary or secondary) goes
down,
and automatically start establishing a new "instance" of that
shard from
the one that did not go down, in order to reestablish redundancy
of that
shard?
ES does not spin up instances of itself. On EC2, we have EC2 monitor
our CPU load on our ES instances and create/destroy instances based on
this metric. The recovery, balancing, etc. is all automatic within ES.
I believe you missed my point, at least in the first part of your
answer, but probably answered it in the end. I was not talking about
starting up new nodes automatically. I was talking about copying a shard
to another existing node if the node where one of the replicas of that
particular shard lived goes down. E.g. lets assume we have three nodes
n1, n2 and n3 with one index consisting of two shards s1 and s2. We run
with 1 replica per shard. s1-primary is running on n1. s1-secondary and
s2-primary is running on n2. s2-secondary is running on n3. Now e.g. n2
goes down, so basically there is only one copy left in the cluster of
both s1 and s2. My question is if n1 and n3 will automatically notice
that n2 went down and start doing the following is order to reestablish
the required level of redundancy (1 replica of all shards):

Creating a copy s1-secondary on n3 from s1-primary on n1
Creating a copy s2-primary on n1 from s2-secondary on n3

- Any comments on ACID transactions? I would be nice if I can have
all
the ACID-properties in a transaction spanning the entire work of a
insert/index-process - all or nothing is inserted/indexed, it is
consistent with respect to what other processes might do
concurrently,
isolation-levels? and durability (guess I have touched that already).
Will ES be able to participate in distributed XA transactions with
two-phase-commit etc.
I believe Shay has said there are no plans for distributed transactions.
What about non-distributed transactions. How to achieve ACID properties
among many threads operating on the ES cluster as the only resource.
Guess Atomicity can be achieved using Bulk indexing, Consistency can be
achieved by the version based optimistic locking you mentioned and that
the answer that will hopefully come to the question "When exactly are
data stored on those disks?" above will give me details on the Durability.
What about Isolation (levels)?
Regarding Consistency, do I understand you correctly, that re-indexing
of an updated version of an existing document will fail, if the document
has changed (and version number there has changed) between the time the
current thread read it and the time it tries to re-index it?
- The numbers mentioned in the beginning (18billion all in all,
50million in/out per day) are only first-attempt-requirements.
Basically
we need to scale to "infinitly" (very far) on those numbers. Are
there
any known limitations on how far a ES cluster can scale - that is, is
there a limit where I cannot just buy more hardware to support higher
numbers?
Why don't you let us know? Seriously. I haven't seen anyone say
that they can't handle a particular load yet by adding more nodes.
Will let you know if we ever get so far using ES
Regards, Per Steffensen

Steff · September 7, 2011, 6:58pm

James Cook skrev:

When updating, the primary shard (and its replicas) are known to ES
based on the routing id of the document. It doesn't use multicast or
blast a message to all nodes in the cluster.
I was concerned about finding the document that needs to be updated -
not about updating/re-indexing it. For "updating" to scale I need to be
able to find the document that I need to update without asking all
shards if they have the document. Basically that reduces to being able
to do get-operations that only consult the one shard where the document
lives if it exists. That at least requires that the search-criteria I
use on the get-operation are "strict" enough to only retrieve 0 or 1
document. But if I am able to make such strict search-criteria, is it
possible to do get-operations that does not consult all shards in the
cluster, but only consults one single shard that contains the document
we are searching for if it exists.

Steff · September 7, 2011, 7:05pm

Berkay Mollamustafaoglu skrev:

In ES you have several options for sharding:
* you can specify number of shards (fixed number) per index, and
  let ES decide how to distribute the docs to the shards, can be
  called "automatic sharding"
* you can use automatic sharding but control which docs go to
  which shard thru "routing"
* you can create indices yourself and use aliases. ES allows you
  to use aliases or specify which indices a query should be
  executed against. sort of self managed sharding 
If you will keep adding new data into the index in time, I think
option 3, self controlled sharding, is more suitable than the
automatic sharding. You can control from how to segment the data in
your code, based on number of docs, size of index etc. date, etc. you
can also control which indices a query should be executed against. For
example, if you split the data based on date, say an index per week,
etc. you can add some intelligence to your queries to direct the query
only to the indexes that are relevant. You can also use combination of
automated and controlled sharding, create an index per week and have
each index have 2-3 shards, etc.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

Very interesting information. I never understood why you would ever have
more than one index, now that documents are untyped. But at least you
explained one reason for me now. Basically both shards and indices can
be used for partitioning documents in isolated chunks. Using shards
within the same index is the easy way because it is very automated.
Using many different indices requires more work because it is more
manual, but also provide more "power" to do "special/advanced" stuff
needed for your particular system. And the methods can even be mixed. Nice.

Steff · September 7, 2011, 7:22pm

Thanks a lot to both of you for taking the time to answer my questions!

Clinton_Gormley · September 8, 2011, 8:15am

Hi Per

We are in the startup-phase of a projekt where we consider using ES. I
have a few questions that I would please like someone to answer before
we can proceed.

To add to what James and Berkay have already said, it would be well
worth your time to watch the presentation that kimchy did at Berlin
Buzzwords:

clint

Steff · September 8, 2011, 9:03am

Clinton Gormley skrev:

Hi Per

We are in the startup-phase of a projekt where we consider using ES. I
have a few questions that I would please like someone to answer before
we can proceed.

To add to what James and Berkay have already said, it would be well
worth your time to watch the presentation that kimchy did at Berlin
Buzzwords:

Elasticsearch Platform — Find real-time answers at scale | Elastic

Thanks. I did watch that already, but I believe most of my questions are
not or only partly answered in that video. Good presentation BTW.

clint

kimchy · September 8, 2011, 5:39pm

Heya,

Great thread. Per, I lost track of possible open questions, if you still
have some, can you open new threads for them?

On Thu, Sep 8, 2011 at 12:03 PM, Per Steffensen steff@designware.dk wrote:

**
Clinton Gormley skrev:

Hi Per

We are in the startup-phase of a projekt where we consider using ES. I
have a few questions that I would please like someone to answer before
we can proceed.

To add to what James and Berkay have already said, it would be well
worth your time to watch the presentation that kimchy did at Berlin
Buzzwords:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Thanks. I did watch that already, but I believe most of my questions are
not or only partly answered in that video. Good presentation BTW.

clint

Steff · September 12, 2011, 1:08pm

Shay Banon skrev:

Heya,

Great thread. Per, I lost track of possible open questions, if you
still have some, can you open new threads for them?

Thanks for replying, Shay.

I have now created new separate threads for (most of the) remaining
questions. For those interested in following the threads:

Transactional ACID features in ES
(http://groups.google.com/group/elasticsearch/browse_thread/thread/a327f4d44f31ae62)
Update/re-index a document
(http://groups.google.com/group/elasticsearch/browse_thread/thread/eed4dc3606a031ed)
Distributed unique constraints
(http://groups.google.com/group/elasticsearch/browse_thread/thread/51aeb1a9665d1f73)
Persistence explained
(http://groups.google.com/group/elasticsearch/browse_thread/thread/fee7dd1a0abd5ac9)
Routing explained
(http://groups.google.com/group/elasticsearch/browse_thread/thread/ab7e907cae9a830a)

For some of the issues I have lowered my ambitions with respect to the
open-ness of the questions - some of the questions have become more
specific. I guess I can always RTFC (Read The F...... Code)

Regards, Per Steffensen

Topic		Replies	Views
First post, some questions Elasticsearch	2	339	July 6, 2017
Evaluating ES Questions Elasticsearch	5	314	July 6, 2017
Evaluating ES Questions Elasticsearch	1	238	July 6, 2017
Few queries on setting up a high performing and scalable ES setup Elasticsearch	3	327	July 6, 2017
ElasticSearch Performance Elasticsearch	4	353	October 12, 2020

Questions from a newbie

Related topics