Elasticsearch sharding


(mbx) #1

Hi,
for my application i've tested Lucene, but quickly had problems with

10 million documents in one index on one server.
Query latency growed up to several seconds and for some keywords
result sets had been very large - up to out of memory.
Now i'm looking for a better solution.
During my research it seems that sharding is state of the art (Solr,
Elasticsearch) to work with large indices ( >20 Mio docs).

My Questions:
Why do i need sharding?
If i want to search over several shards (e.g. jan2010 upto dec2010)
i've to merge the results. Isn't it the same work than searching in a
large 2010-index?

How does sharding work for search? I think understood how the
documents are hashed and distributed to different shards but where are
they merged?

Sorry for the beginner questions.
Tank you!
-mbx


(Karussell) #2

Results are merged when querying. there are different query types.
take a look into:

http://www.marcsturlese.com/2010/02/12/elasticsearch/

Hitting several indices instead of one allows you to use several CPUs
or to put the shards on different servers.

More on that subject should be explained by the author of ES :slight_smile: (i'm
not an expert ...)

On 14 Feb., 15:04, mbx myze...@googlemail.com wrote:

Hi,
for my application i've tested Lucene, but quickly had problems with>10 million documents in one index on one server.

Query latency growed up to several seconds and for some keywords
result sets had been very large - up to out of memory.
Now i'm looking for a better solution.
During my research it seems that sharding is state of the art (Solr,
Elasticsearch) to work with large indices ( >20 Mio docs).

My Questions:
Why do i need sharding?
If i want to search over several shards (e.g. jan2010 upto dec2010)
i've to merge the results. Isn't it the same work than searching in a
large 2010-index?

How does sharding work for search? I think understood how the
documents are hashed and distributed to different shards but where are
they merged?

Sorry for the beginner questions.
Tank you!
-mbx


(Shay Banon) #3

Thats the gist of it. Spreading a the search over multiple shards (that exist on several machines) will mean searching over a smaller index (per shard), and then joining the results back.
On Monday, February 14, 2011 at 8:51 PM, Karussell wrote:

Results are merged when querying. there are different query types.
take a look into:

http://www.marcsturlese.com/2010/02/12/elasticsearch/

Hitting several indices instead of one allows you to use several CPUs
or to put the shards on different servers.

More on that subject should be explained by the author of ES :slight_smile: (i'm
not an expert ...)

On 14 Feb., 15:04, mbx myze...@googlemail.com wrote:

Hi,
for my application i've tested Lucene, but quickly had problems with>10 million documents in one index on one server.

Query latency growed up to several seconds and for some keywords
result sets had been very large - up to out of memory.
Now i'm looking for a better solution.
During my research it seems that sharding is state of the art (Solr,
Elasticsearch) to work with large indices ( >20 Mio docs).

My Questions:
Why do i need sharding?
If i want to search over several shards (e.g. jan2010 upto dec2010)
i've to merge the results. Isn't it the same work than searching in a
large 2010-index?

How does sharding work for search? I think understood how the
documents are hashed and distributed to different shards but where are
they merged?

Sorry for the beginner questions.
Tank you!
-mbx


(system) #4