ElasticSearch on Amazon EC2 tips

Hi,

Im researching possibility of moving from using solr to elastic search,
because to be honest it looks really cool :slight_smile: and i cant wait to start
playing with it
but before i will start i would like to get some tips/suggestions from more
experienced users; the things im interested in are:

  1. what instance size will be the best considering economy/power (so for
    example m1.xlarge instance is twice as expensive as m1.large, but will it
    increase proportionally performance of ES cluster (in terms of response
    time, and amount of concurrent requests it can process))
  2. how many shards i can run on every node (by default it is 5, how many
    more i can use without affecting performance)
  3. is it better to have separate cluster per index or it doesn't matter
    (from performance point of view)
  4. EBS vs ephemeral vs SSD drives (how big is the performance difference ?)
  5. are ephemeral drives safe enough with replication factor lets say 3
  6. how consistent is the performance of ES on EC2 (will response time
    spike from time to time above 2-3 sec because of some commits to the index?)

statistics from my current solr instances:

Number of instances: 3 (m1.xlarge)
Number of documents: ~15m
Requests per second: ~10
Results page size: 16 documents
Average total count per query: ~100k documents

queries are quite different one from another, so they are not easy to cache
we are using faceting, filtering by field, custom sorting, ...

Any input will be greatly appreciated
Thx,

--

Hi,

One of the biggest issues with EC2 in general is availability. Instances
can go down. EBS volumes are not invulnarable either. When individual
instances or volumes go down you are typically fine because replication
saves you.

But when the whole zone has a problem, and that seems to happen at least
once per year, at least in the east zone which is in North Virginia, then
replication within the same zone doesn't help. Then you start thinking
about having nodes in multiple zones, and with that comes extra cost.

  1. depends, you'll want to test
  2. depends, you'll want to test
  3. you can have multiple indices per cluster, as long as you are not
    overwhelming it, so again it depends on the details
  4. didn't test it, but I imagine the difference is big. That said, Amazon
    has a deal wit guaranteed IOPS and EBS and just announced a new option
    today, I think
  5. yes, typically, but see the paragraph above
  6. there is no consistency, which is why it's hard to test performance of
    EC2. Look up info on noisy neighbour.

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html

On Wednesday, November 7, 2012 1:42:20 PM UTC-5, Karol Gwaj wrote:

Hi,

Im researching possibility of moving from using solr to elastic search,
because to be honest it looks really cool :slight_smile: and i cant wait to start
playing with it
but before i will start i would like to get some tips/suggestions from
more experienced users; the things im interested in are:

  1. what instance size will be the best considering economy/power (so for
    example m1.xlarge instance is twice as expensive as m1.large, but will it
    increase proportionally performance of ES cluster (in terms of response
    time, and amount of concurrent requests it can process))
  2. how many shards i can run on every node (by default it is 5, how many
    more i can use without affecting performance)
  3. is it better to have separate cluster per index or it doesn't matter
    (from performance point of view)
  4. EBS vs ephemeral vs SSD drives (how big is the performance difference
    ?)
  5. are ephemeral drives safe enough with replication factor lets say 3
  6. how consistent is the performance of ES on EC2 (will response time
    spike from time to time above 2-3 sec because of some commits to the index?)

statistics from my current solr instances:

Number of instances: 3 (m1.xlarge)
Number of documents: ~15m
Requests per second: ~10
Results page size: 16 documents
Average total count per query: ~100k documents

queries are quite different one from another, so they are not easy to
cache
we are using faceting, filtering by field, custom sorting, ...

Any input will be greatly appreciated
Thx,

--

Karol Gwaj wrote:

  1. what instance size will be the best considering economy/power
    (so for example m1.xlarge instance is twice as expensive as
    m1.large, but will it increase proportionally performance of ES
    cluster (in terms of response time, and amount of concurrent
    requests it can process))

Totally depends on your data. Will probably have to experiment here.

  1. how many shards i can run on every node (by default it is 5,
    how many more i can use without affecting performance)

You can likely go way higher that single digits. You don't want a
design where they can grow indefinitely, but don't be afraid of using
them. Keep in mind that for a single index you don't need more
than one per node. Watch some overview talks like this one to get a
better idea of what these concepts mean.

http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html

  1. is it better to have separate cluster per index or it doesn't
    matter (from performance point of view)

Typically not.

  1. EBS vs ephemeral vs SSD drives (how big is the performance difference ?)

Generally speaking, for a lot of data that can't fit in memory, SSDs
will improve your disk seeks. But you really have to profile to
determine if the cost is worth it.

  1. are ephemeral drives safe enough with replication factor lets say 3

"Safe enough" can only be defined by you. :slight_smile: It isn't terribly
likely that three ec2 nodes can disappear or corrupt, but I've seen
it before. Normal storage practices apply here.

  1. how consistent is the performance of ES on EC2 (will response
    time spike from time to time above 2-3 sec because of some commits
    to the index?)

Depends on usage. Heavy indexing can affect search perf but more
replicas helps.

queries are quite different one from another, so they are not easy to cache
we are using faceting, filtering by field, custom sorting, ...

Faceting and sorting will typically use more memory than simple
queries. Start simple and gradually add functionality & data.
You'll get a feel for where limits are and whether you need to move
to bigger hardware.

-Drew

--

Thx for help guys

--