Elastic search


(Mohit Anchlia) #1

I am new to elastic search and trying to understand the concept. I am
trying to find the information:

  1. about how it distributes, replicates data for HA.
  2. Where does it store the data?
  3. Optimization techniques

(Paul Loy) #2
  1. ES shards and replicates indexes. It is what I would call 'statically
    sharded' - that is you specify up front the number of shards and replicas
    you want and that's how many there will be. Shards and replicas are then
    allocated to nodes in your cluster.

  2. Up to you:
    http://www.elasticsearch.org/guide/reference/index-modules/store.html

  3. Depends upon your use case. Everyone's data and everyone's indexes will
    be different.

On Sun, Aug 7, 2011 at 8:04 PM, Mo mohitanchlia@gmail.com wrote:

I am new to elastic search and trying to understand the concept. I am
trying to find the information:

  1. about how it distributes, replicates data for HA.
  2. Where does it store the data?
  3. Optimization techniques

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(Mohit Anchlia) #3

On Sun, Aug 7, 2011 at 3:56 PM, Paul Loy keteracel@gmail.com wrote:

  1. ES shards and replicates indexes. It is what I would call 'statically
    sharded' - that is you specify up front the number of shards and replicas
    you want and that's how many there will be. Shards and replicas are then
    allocated to nodes in your cluster.

Is there a link where I can read how to configure that? Also, does it
make it HA for eg: if on enode goes down then it doesn't impact the
searching?

  1. Up to you:
    http://www.elasticsearch.org/guide/reference/index-modules/store.html

How to decide which one to use? I also see it integrates with CouchDB.
When having TBs of data is it ok to keep on the file system?

  1. Depends upon your use case. Everyone's data and everyone's indexes will
    be different.

Are there any general guidelines that might be applicable to everyone
or at least gives litte more thought processing into design this
efficiently?

On Sun, Aug 7, 2011 at 8:04 PM, Mo mohitanchlia@gmail.com wrote:

I am new to elastic search and trying to understand the concept. I am
trying to find the information:

  1. about how it distributes, replicates data for HA.
  2. Where does it store the data?
  3. Optimization techniques

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(Paul Loy) #4

On Mon, Aug 8, 2011 at 6:34 AM, Mohit Anchlia mohitanchlia@gmail.comwrote:

On Sun, Aug 7, 2011 at 3:56 PM, Paul Loy keteracel@gmail.com wrote:

  1. ES shards and replicates indexes. It is what I would call 'statically
    sharded' - that is you specify up front the number of shards and replicas
    you want and that's how many there will be. Shards and replicas are then
    allocated to nodes in your cluster.

Is there a link where I can read how to configure that? Also, does it
make it HA for eg: if on enode goes down then it doesn't impact the
searching?

Basic configuration will be the index settings where you can set the number
of shards and the number of replicas of an index.

http://www.elasticsearch.org/guide/reference/api/admin-indices-create-index.html

What's awesome with ES is that you can specify this on a per index basis. So
more critical indices can have a higher number of replicas.

Regarding HA, the was I understand it (and Shay can probably correct me if
I'm wrong), there is a 'master' node for a shard. If that node dies, another
node with a replica is voted the 'master'. So searches should not be
impacted if a node goes down. Obviously if you had enough nodes for one per
shard and a node goes down then one node will now have to do 2 shards of
searches and so may be slower. So while you can still run searches, you'll
need to think about redundancy in your cluster.

  1. Up to you:
    http://www.elasticsearch.org/guide/reference/index-modules/store.html

How to decide which one to use? I also see it integrates with CouchDB.
When having TBs of data is it ok to keep on the file system?

This will be better answered by one of the guys on this list that also
pushes TBs of data. I'm only at the GBs size so I use S3 for a gateway just
to be sure. I guess the quick answer is you can scale out to meet your
needs! If FS is a bottleneck you can add more nodes!?

  1. Depends upon your use case. Everyone's data and everyone's indexes
    will
    be different.

Are there any general guidelines that might be applicable to everyone
or at least gives litte more thought processing into design this
efficiently?

Lots, and it really is dependent on your data and how you want to search it.
Some tips I've used are to use filters as much as possible, which seems to
have given us a very stable, low latency ES cluster.

On Sun, Aug 7, 2011 at 8:04 PM, Mo mohitanchlia@gmail.com wrote:

I am new to elastic search and trying to understand the concept. I am
trying to find the information:

  1. about how it distributes, replicates data for HA.
  2. Where does it store the data?
  3. Optimization techniques

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(Mohit Anchlia) #5

Are there any recommendation as to when to use DB compared to file system?

Our use case is simple:

  1. We have tons of column name and values in NoSQL column families
    that we need to have search capabilities on since NoSQL cassandra
    isn't really very good when you need lots of indexes. These are mostly
    distinct values.
  2. We have xml docs that have attributes that we need to search for.
    These have low cardinality.

On Sun, Aug 7, 2011 at 3:56 PM, Paul Loy keteracel@gmail.com wrote:

  1. ES shards and replicates indexes. It is what I would call 'statically
    sharded' - that is you specify up front the number of shards and replicas
    you want and that's how many there will be. Shards and replicas are then
    allocated to nodes in your cluster.

  2. Up to you:
    http://www.elasticsearch.org/guide/reference/index-modules/store.html

  3. Depends upon your use case. Everyone's data and everyone's indexes will
    be different.

On Sun, Aug 7, 2011 at 8:04 PM, Mo mohitanchlia@gmail.com wrote:

I am new to elastic search and trying to understand the concept. I am
trying to find the information:

  1. about how it distributes, replicates data for HA.
  2. Where does it store the data?
  3. Optimization techniques

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(Shay Banon) #6

What kind of recommendations are you after? Not sure I understand the
question properly to answer it...

On Wed, Aug 10, 2011 at 6:51 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

Are there any recommendation as to when to use DB compared to file system?

Our use case is simple:

  1. We have tons of column name and values in NoSQL column families
    that we need to have search capabilities on since NoSQL cassandra
    isn't really very good when you need lots of indexes. These are mostly
    distinct values.
  2. We have xml docs that have attributes that we need to search for.
    These have low cardinality.

On Sun, Aug 7, 2011 at 3:56 PM, Paul Loy keteracel@gmail.com wrote:

  1. ES shards and replicates indexes. It is what I would call 'statically
    sharded' - that is you specify up front the number of shards and replicas
    you want and that's how many there will be. Shards and replicas are then
    allocated to nodes in your cluster.

  2. Up to you:
    http://www.elasticsearch.org/guide/reference/index-modules/store.html

  3. Depends upon your use case. Everyone's data and everyone's indexes
    will
    be different.

On Sun, Aug 7, 2011 at 8:04 PM, Mo mohitanchlia@gmail.com wrote:

I am new to elastic search and trying to understand the concept. I am
trying to find the information:

  1. about how it distributes, replicates data for HA.
  2. Where does it store the data?
  3. Optimization techniques

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(Mohit Anchlia) #7

On Sat, Aug 13, 2011 at 9:45 AM, Shay Banon kimchy@gmail.com wrote:

What kind of recommendations are you after? Not sure I understand the
question properly to answer it...

How to decide to use File system or CouchDB? What would be the reason
people would chose one over other? Is it just because you can see data
in some form directly in the DB?

On Wed, Aug 10, 2011 at 6:51 PM, Mohit Anchlia mohitanchlia@gmail.com
wrote:

Are there any recommendation as to when to use DB compared to file system?

Our use case is simple:

  1. We have tons of column name and values in NoSQL column families
    that we need to have search capabilities on since NoSQL cassandra
    isn't really very good when you need lots of indexes. These are mostly
    distinct values.
  2. We have xml docs that have attributes that we need to search for.
    These have low cardinality.

On Sun, Aug 7, 2011 at 3:56 PM, Paul Loy keteracel@gmail.com wrote:

  1. ES shards and replicates indexes. It is what I would call 'statically
    sharded' - that is you specify up front the number of shards and
    replicas
    you want and that's how many there will be. Shards and replicas are then
    allocated to nodes in your cluster.

  2. Up to you:
    http://www.elasticsearch.org/guide/reference/index-modules/store.html

  3. Depends upon your use case. Everyone's data and everyone's indexes
    will
    be different.

On Sun, Aug 7, 2011 at 8:04 PM, Mo mohitanchlia@gmail.com wrote:

I am new to elastic search and trying to understand the concept. I am
trying to find the information:

  1. about how it distributes, replicates data for HA.
  2. Where does it store the data?
  3. Optimization techniques

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(Shay Banon) #8

Still not understanding... . Use a file system or use couchdb? How does
relate to elasticsearch? If not, I can still try and help :), but need more
info, you want to store blobs on the file system?

On Sat, Aug 13, 2011 at 8:25 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

On Sat, Aug 13, 2011 at 9:45 AM, Shay Banon kimchy@gmail.com wrote:

What kind of recommendations are you after? Not sure I understand the
question properly to answer it...

How to decide to use File system or CouchDB? What would be the reason
people would chose one over other? Is it just because you can see data
in some form directly in the DB?

On Wed, Aug 10, 2011 at 6:51 PM, Mohit Anchlia mohitanchlia@gmail.com
wrote:

Are there any recommendation as to when to use DB compared to file
system?

Our use case is simple:

  1. We have tons of column name and values in NoSQL column families
    that we need to have search capabilities on since NoSQL cassandra
    isn't really very good when you need lots of indexes. These are mostly
    distinct values.
  2. We have xml docs that have attributes that we need to search for.
    These have low cardinality.

On Sun, Aug 7, 2011 at 3:56 PM, Paul Loy keteracel@gmail.com wrote:

  1. ES shards and replicates indexes. It is what I would call
    'statically

sharded' - that is you specify up front the number of shards and
replicas
you want and that's how many there will be. Shards and replicas are
then

allocated to nodes in your cluster.

  1. Up to you:
    http://www.elasticsearch.org/guide/reference/index-modules/store.html

  2. Depends upon your use case. Everyone's data and everyone's indexes
    will
    be different.

On Sun, Aug 7, 2011 at 8:04 PM, Mo mohitanchlia@gmail.com wrote:

I am new to elastic search and trying to understand the concept. I am
trying to find the information:

  1. about how it distributes, replicates data for HA.
  2. Where does it store the data?
  3. Optimization techniques

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(Mohit Anchlia) #9

On Sat, Aug 13, 2011 at 12:43 PM, Shay Banon kimchy@gmail.com wrote:

Still not understanding... . Use a file system or use couchdb? How does
relate to elasticsearch? If not, I can still try and help :), but need more
info, you want to store blobs on the file system?

From what I understand indexes are stored somewhere on the disk. And
from the link http://www.elasticsearch.org/guide/reference/index-modules/store.html
it looks like you have various options. So I am trying to understand
if it should be stored on the file system or some DB like couchDB?

Doesn't elasticsearch store indexed data somewhere?

On Sat, Aug 13, 2011 at 8:25 PM, Mohit Anchlia mohitanchlia@gmail.com
wrote:

On Sat, Aug 13, 2011 at 9:45 AM, Shay Banon kimchy@gmail.com wrote:

What kind of recommendations are you after? Not sure I understand the
question properly to answer it...

How to decide to use File system or CouchDB? What would be the reason
people would chose one over other? Is it just because you can see data
in some form directly in the DB?

On Wed, Aug 10, 2011 at 6:51 PM, Mohit Anchlia mohitanchlia@gmail.com
wrote:

Are there any recommendation as to when to use DB compared to file
system?

Our use case is simple:

  1. We have tons of column name and values in NoSQL column families
    that we need to have search capabilities on since NoSQL cassandra
    isn't really very good when you need lots of indexes. These are mostly
    distinct values.
  2. We have xml docs that have attributes that we need to search for.
    These have low cardinality.

On Sun, Aug 7, 2011 at 3:56 PM, Paul Loy keteracel@gmail.com wrote:

  1. ES shards and replicates indexes. It is what I would call
    'statically
    sharded' - that is you specify up front the number of shards and
    replicas
    you want and that's how many there will be. Shards and replicas are
    then
    allocated to nodes in your cluster.

  2. Up to you:
    http://www.elasticsearch.org/guide/reference/index-modules/store.html

  3. Depends upon your use case. Everyone's data and everyone's indexes
    will
    be different.

On Sun, Aug 7, 2011 at 8:04 PM, Mo mohitanchlia@gmail.com wrote:

I am new to elastic search and trying to understand the concept. I
am
trying to find the information:

  1. about how it distributes, replicates data for HA.
  2. Where does it store the data?
  3. Optimization techniques

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(James Cook) #10

Hi Mo,

There seems to be a disconnect with your questions and some fundamental
understanding of how ES (and Lucene) works. I think you need to read the
website a bit more, especially take a look at the video:
http://www.elasticsearch.org/videos/2011/08/09/road-to-a-distributed-searchengine-berlinbuzzwords.html

Index storage is under the control of Lucene, and the store pagehttp://www.elasticsearch.org/guide/reference/index-modules/store.htmlyou link to describes your options with simplefs, niofs being file-based,
memory being memory-based, and mmapfs being a hybrid of the two. I'm not
sure where you got the idea that indexes can also be stored in a DB like
CouchDB.

There is the concept of a River which is a bridge between CouchDB (and
others) and ES. A River will receive push changes or periodically will pull
changes from a source (like CouchDB, not sure if CouchDB River pushes or
pulls) and index the data it receives. This is a technique that can be used
to put things for searching into ES without the developer having to
specifically index documents into ES. It has nothing to do with how data is
stored in ES.

-- jim


(Mohit Anchlia) #11

On Sun, Aug 14, 2011 at 6:00 AM, James Cook jcook@tracermedia.com wrote:

Hi Mo,
There seems to be a disconnect with your questions and some fundamental
understanding of how ES (and Lucene) works. I think you need to read the
website a bit more, especially take a look at the video:
http://www.elasticsearch.org/videos/2011/08/09/road-to-a-distributed-searchengine-berlinbuzzwords.html

Index storage is under the control of Lucene, and the store page you link to
describes your options with simplefs, niofs being file-based, memory being
memory-based, and mmapfs being a hybrid of the two. I'm not sure where you
got the idea that indexes can also be stored in a DB like CouchDB.
There is the concept of a River which is a bridge between CouchDB (and
others) and ES. A River will receive push changes or periodically will pull
changes from a source (like CouchDB, not sure if CouchDB River pushes or
pulls) and index the data it receives. This is a technique that can be used
to put things for searching into ES without the developer having to
specifically index documents into ES. It has nothing to do with how data is
stored in ES.

Thanks for clarifying. I will go through that presentation.

-- jim


(system) #12