Incrementally scaling ES from the small data


(pellyadolfo) #1

Hi, I plan to start with a small project, initially, with small data (few
thousands records) to learn ES response, and, incrementally, increase data
and resources on demand, to the big data, taking advantage of ES
scalability.

Is there a document describing such a strategy, i.e.:

  • how to properly configure an small basic deployment with good performance
    on low resources? (shards, nodes, clusters...)

  • then, how to keep detecting the necessity of incrementally adding
    resources, shard/nodes..., according to increases on data load?

All docs that I find on scaling ES starts on deployments with m/billions of
records.

Alternatively, any advice on properly "configuring ES for the small data"?
(as a starting point?)

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/79926cfe-4365-4a34-895b-70835ae895dc%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Boaz Leskes) #2

Hi Adolfo,

The best way to scale depends on your data and how it behaves. You can
watch this great talk by Shay about two use cases to get
inspired: http://www.elasticsearch.org/videos/big-data-search-and-analytics/

Cheers,
Boaz

On Tuesday, January 7, 2014 8:13:18 PM UTC+1, Adolfo Rodriguez wrote:

Hi, I plan to start with a small project, initially, with small data (few
thousands records) to learn ES response, and, incrementally, increase data
and resources on demand, to the big data, taking advantage of ES
scalability.

Is there a document describing such a strategy, i.e.:

  • how to properly configure an small basic deployment with good
    performance on low resources? (shards, nodes, clusters...)

  • then, how to keep detecting the necessity of incrementally adding
    resources, shard/nodes..., according to increases on data load?

All docs that I find on scaling ES starts on deployments with m/billions
of records.

Alternatively, any advice on properly "configuring ES for the small data"?
(as a starting point?)

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3d444d6f-fa0d-4567-a46b-538ea9b379f9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Mark Walkom) #3

As a really, really rough guide;
Start with a small instance, 4-8G RAM (2-4G heap). Keep loading documents
until things start to slow down (ie query/update responsiveness drops). Add
a new node.
Rinse and repeat.

If you have one node there is no point using replicas as they have nowhere
to go. You can easily add replicas later though so it's no big deal.
Shards is a little harder, start with the standard/default of 8 shards and
go from there. Using aliases can allow you to reindex your data later if
you feel you may want to change this.

You can monitor your cluster with a range of monitoring plugins -
elasticHQ, kopf, elasticsearch-monitoring, bigdesk. Just search for them on
github.

As Boaz mentioned, it really does depend on what you are doing. Chances are
you will go through all this and get to a point where you want to rebuild
your cluster with all your gained knowledge!

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 8 January 2014 09:18, Boaz Leskes b.leskes@gmail.com wrote:

Hi Adolfo,

The best way to scale depends on your data and how it behaves. You can
watch this great talk by Shay about two use cases to get inspired:
http://www.elasticsearch.org/videos/big-data-search-and-analytics/

Cheers,
Boaz

On Tuesday, January 7, 2014 8:13:18 PM UTC+1, Adolfo Rodriguez wrote:

Hi, I plan to start with a small project, initially, with small data (few
thousands records) to learn ES response, and, incrementally, increase data
and resources on demand, to the big data, taking advantage of ES
scalability.

Is there a document describing such a strategy, i.e.:

  • how to properly configure an small basic deployment with good
    performance on low resources? (shards, nodes, clusters...)

  • then, how to keep detecting the necessity of incrementally adding
    resources, shard/nodes..., according to increases on data load?

All docs that I find on scaling ES starts on deployments with m/billions
of records.

Alternatively, any advice on properly "configuring ES for the small
data"? (as a starting point?)

Thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/3d444d6f-fa0d-4567-a46b-538ea9b379f9%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624ZRacXqWCg56kFvjYsf1_cDxLT4Drhdbk6jFL5_Q1EekA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(pellyadolfo) #4

Thanks both for your comments.

Shards is a little harder, start with the* standard/default of 8 shards*and go from there.

  • This is the point that is confusing me the most. For a very small initial
    deployment, with a few thousand docs, why not using just define 1 shard
    with no replica? What criteria you used to set 8 shards as a default (BTW,
    defaults - in ES 0.90.5 - are 5 Successful Shards, 5 Unassigned Shards, is
    not it?).

  • Suppose that you start with the smaller minimum setup: 1 cluster, 1 node,
    1 shard, no replica, Will I be able to incrementally scale any of these
    settings up? And will I able also to scale any of these settings down
    after? (or will need to repopulate ES in any particular case). The idea is
    testing different configs.

  • In my current particular case, can I scale down my current 5 shards/1
    replica (default 0.90.5 AFAIK) to 1 shard/no replica? And start from there?

The reason I am concerned about this is that I see lot of sockets (maybe
200 hundreds on my system - 2 ES on different apps in same machine - and
want to understand where they come from and how to allocate the optimum). I
watched Shai's presentation yesterday but could no grasp this info.

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4e8b513f-42a0-45e7-b677-842876c2570b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ivan Brusic) #5

Elasticsearch uses consistent hashing, so you cannot change the number of
shards for an index.

If you can reindex data, then you can create a new index with a different
number of shards and simply reindex. If your data is temporal in nature,
you can create a new index per day/week/month and these new indices can
have a different shard value. You can search against multiple indices even
if they have different shard values.

IMHO, shard values in the high single digits (5-10) is a great starting
point. Even with a single node cluster, the default number of shards (5)
should not cause any performance issues.

Cheers,

Ivan

On Tue, Jan 7, 2014 at 4:47 PM, Adolfo Rodriguez pellyadolfo@yahoo.eswrote:

Thanks both for your comments.

Shards is a little harder, start with the* standard/default of 8 shards*and go from there.

  • This is the point that is confusing me the most. For a very small
    initial deployment, with a few thousand docs, why not using just define 1
    shard with no replica? What criteria you used to set 8 shards as a default
    (BTW, defaults - in ES 0.90.5 - are 5 Successful Shards, 5 Unassigned
    Shards, is not it?).

  • Suppose that you start with the smaller minimum setup: 1 cluster, 1
    node, 1 shard, no replica, Will I be able to incrementally scale any of
    these settings up? And will I able also to scale any of these settings down
    after? (or will need to repopulate ES in any particular case). The idea is
    testing different configs.

  • In my current particular case, can I scale down my current 5 shards/1
    replica (default 0.90.5 AFAIK) to 1 shard/no replica? And start from there?

The reason I am concerned about this is that I see lot of sockets (maybe
200 hundreds on my system - 2 ES on different apps in same machine - and
want to understand where they come from and how to allocate the optimum). I
watched Shai's presentation yesterday but could no grasp this info.

Thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4e8b513f-42a0-45e7-b677-842876c2570b%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDzAdvA1mNk%2BBUb-4N5mPayP9MCBXm%2BONsptYhnBOhFgA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(pellyadolfo) #6

Thanks Ivan,

Elasticsearch uses consistent hashing, so you cannot change the number of
shards for an index.

So, I understand that, once the index is created, is only possible to
scale, up and down, nodes, clusters and replicas. But no shards.
Interesting.

IMHO, shard values in the high single digits (5-10) is a great starting
point.
Even with a single node cluster, the default number of shards (5)
should not cause any performance issues.

I am worried about the 200 hundred established sockets in my machine
(running 2 ES) since I suspect they are producing me some random data lose
on getting highlighting information. And I was wondering if setting just 1
shard/0 replica on each ES would get rid of these unwanted sockets (?). Why
is advised to start with (5-10) rather than with (1-0) * 2 ES ? Any reason?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/42c801d9-83ac-4096-b148-f973dadaeb1e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ivan Brusic) #7

An increase of shards will not cause an increase in sockets used. Each node
shard action is responsible for gather the responses from each shard at the
file-level before sending the response back to the client.

Since each shard is actually its own Lucene index, an increase of shards
will increase metrics at the IO level, especially the number of open file
descriptors.

It is advised to start of with 5 because that would allow you to scale an
index horizontally without needing to reindex. You can increase your
cluster from 1 to 5 and each node will have a piece of the index instead of
the entire index that. Beyond that number, you can distribute the index
with more replicas. More shards increase availability IMHO. Ultimately you
do not want large shards for performance reasons.

--
Ivan

On Tue, Jan 7, 2014 at 5:23 PM, Adolfo Rodriguez pellyadolfo@yahoo.eswrote:

I am worried about the 200 hundred established sockets in my machine
(running 2 ES) since I suspect they are producing me some random data lose
on getting highlighting information. And I was wondering if setting just 1
shard/0 replica on each ES would get rid of these unwanted sockets (?). Why
is advised to start with (5-10) rather than with (1-0) * 2 ES ? Any reason?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDcRQsnr_WONKAcu8QWiroHabhfD9spLKk2qcqatTfgrQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(pellyadolfo) #8

Thanks Ivan, makes sense. Still could not test how sockets relate to shards
and why I automatically get 10 established sockets when opening a client:

node = builder.client(clientOnly).data(!clientOnly).local(local).node();

client = node.client();

on default ES configuration, and many many more sockets after (up to 200),
and how this number changes when increasing/decreasing number of shards,

but happily I managed to fix the initial issue of highlighting info being
randomly lost by a config change as described here:

https://groups.google.com/d/msg/elasticsearch/3t6UL_vzM7o/TLnV2m2B1NAJ

so sockets does not look an issue anymore.

Regards.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8a65007d-1053-4842-9c6b-93564b3ec44f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #9

Adolfo,

Still could not test how sockets relate to shards and why I automatically

get 10 established sockets when opening a client:

node = builder.client(clientOnly).data(!clientOnly).local(local).node();

client = node.client();

on default ES configuration, and many many more sockets after (up to 200),
and how this number changes when increasing/decreasing number of shards,

Of course, your application should create only one client and then let all
threads within the application share that one client. Each client,
especially the NodeClient, typically creates a thread pool behind it. It's
a very heavy-weight object, so do not create more than one of them. But
it's perfectly thread-safe and can (should) be used by as many threads in
your application as desired.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/653b907c-38fe-4cfd-9cb9-1e7dcfae9c00%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ivan Brusic) #10

BTW, I was very wrong when I mentioned that elasticsearch uses consistent
hashing. It uses modulo-based hashing, which is why the number of shards
cannot change since the modulo is fixed. Working on too many things at once
while replying. :slight_smile:

On Wed, Jan 8, 2014 at 1:10 PM, InquiringMind brian.from.fl@gmail.comwrote:

Adolfo,

Still could not test how sockets relate to shards and why I automatically

get 10 established sockets when opening a client:

node = builder.client(clientOnly).data(!clientOnly).local(local).node();

client = node.client();

on default ES configuration, and many many more sockets after (up to
200), and how this number changes when increasing/decreasing number of
shards,

Of course, your application should create only one client and then let all
threads within the application share that one client. Each client,
especially the NodeClient, typically creates a thread pool behind it. It's
a very heavy-weight object, so do not create more than one of them. But
it's perfectly thread-safe and can (should) be used by as many threads in
your application as desired.

Brian

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/653b907c-38fe-4cfd-9cb9-1e7dcfae9c00%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDjLig-7%3D-_Ha%3D9Mi36um_qjqdJjMj-Ju7scg%3DKpxjpFA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #11