Need some help / idea about architecture

Hi everyone,
I'am working on a new webapp which is dedicated to search documents on
several criterias.

We intend to use elasticsearch as a search engine and are very keen on it (
@Kimchy : fantastic job ! ; ) ) .

  • the documents we want to index are quite complex : several data levels ,
    so we use nested fields in our mapping;

  • the space used in the source (3 databases couchDB) is 900 Go by year,

  • these 3 differents databases in couchDB are indexed in ES on 1 cluster:
    the space used by all the indexes in ES is about 2.1 To a year,

  • we have an index per month per database source (12 indexes by year per
    database source)

  • each index has 2 types and 1 replica, 5 shards

  • the total number of indexed documents is 15 Millions

  • the data indexed are stored on a disk bay (RAID-5)

  • several fields (about 150) are opened for search

  • we intend to use facets in queries (this will be a new functionnality and
    could increase the numbers of queries done and users logged),

  • each query is limited to a search period of only 1 year

  • 1000 differents users log every day to execute a mean of 2 queries


We are facing the problem of the architecture to start (number of servers,
number of CPUs, power, RAM, etc ..)
We need to have a scalable solution because we will have to index 4 years
of datas without decreasing perfs.

Has anyone an idea of the best approach ?
Help will be very usefull and appreciated.
Many thanks

Hi,

You can sample portion of the data and do estimate based on this, no?

--
Regards,
Lukas

On Tuesday, December 13, 2011 at 6:05 PM, TheKnto wrote:

Hi everyone,
I'am working on a new webapp which is dedicated to search documents on several criterias.

We intend to use elasticsearch as a search engine and are very keen on it ( @Kimchy : fantastic job ! ; ) ) .

  • the documents we want to index are quite complex : several data levels , so we use nested fields in our mapping;

  • the space used in the source (3 databases couchDB) is 900 Go by year,

  • these 3 differents databases in couchDB are indexed in ES on 1 cluster: the space used by all the indexes in ES is about 2.1 To a year,

  • we have an index per month per database source (12 indexes by year per database source)

  • each index has 2 types and 1 replica, 5 shards

  • the total number of indexed documents is 15 Millions

  • the data indexed are stored on a disk bay (RAID-5)

  • several fields (about 150) are opened for search

  • we intend to use facets in queries (this will be a new functionnality and could increase the numbers of queries done and users logged),

  • each query is limited to a search period of only 1 year

  • 1000 differents users log every day to execute a mean of 2 queries


We are facing the problem of the architecture to start (number of servers, number of CPUs, power, RAM, etc ..)
We need to have a scalable solution because we will have to index 4 years of datas without decreasing perfs.

Has anyone an idea of the best approach ?
Help will be very usefull and appreciated.
Many thanks

Hello Lukáš,
at this time we have only succeeded on indexing 1 month of 1 database
source on a cluster with 2 nodes (5Go RAM). Each nodes was running on a
separate server but these servers were also running the couchdb job. So
it's not the good config and the results may be wrong.

As for the production architecture, we have to define it in order to buy
the machines, so it's quite risky to evaluate the results based on one
month. I would hope that someone has experiment to share ...

best regards,
Cyril

2011/12/13 Lukáš Vlček lukas.vlcek@gmail.com

Hi,

You can sample portion of the data and do estimate based on this, no?

--
Regards,
Lukas

On Tuesday, December 13, 2011 at 6:05 PM, TheKnto wrote:

Hi everyone,
I'am working on a new webapp which is dedicated to search documents on
several criterias.

We intend to use elasticsearch as a search engine and are very keen on it
( @Kimchy : fantastic job ! ; ) ) .

  • the documents we want to index are quite complex : several data levels ,
    so we use nested fields in our mapping;

  • the space used in the source (3 databases couchDB) is 900 Go by year,

  • these 3 differents databases in couchDB are indexed in ES on 1 cluster:
    the space used by all the indexes in ES is about 2.1 To a year,

  • we have an index per month per database source (12 indexes by year per
    database source)

  • each index has 2 types and 1 replica, 5 shards

  • the total number of indexed documents is 15 Millions

  • the data indexed are stored on a disk bay (RAID-5)

  • several fields (about 150) are opened for search

  • we intend to use facets in queries (this will be a new functionnality
    and could increase the numbers of queries done and users logged),

  • each query is limited to a search period of only 1 year

  • 1000 differents users log every day to execute a mean of 2 queries


We are facing the problem of the architecture to start (number of servers,
number of CPUs, power, RAM, etc ..)
We need to have a scalable solution because we will have to index 4 years
of datas without decreasing perfs.

Has anyone an idea of the best approach ?
Help will be very usefull and appreciated.
Many thanks

Sadly, I do not think anybody can give you guaranteed answer because it depends on many factors, it is not only about number of documents and number of queries. It is also important how exactly you structure the documents before indexing and which analysis you execute on it, and on the type of queries (simple, facets, prefix, … etc). Also I bet you if you just started with the ES stuff then I am sure sooner or later you will find that you want change your documents mapping and analysis or rewrite some queries… this will have impact as well. Also you should consider how you want to go about upgrades if complete reindexing will be necessary.
Can't you for example rent AWS for some time and do capacity planning there? May be that would help.

But may be someone can share some experience, I personally do not have experience with that big data in ES.

--
Regards,
Lukas

On Tuesday, December 13, 2011 at 6:35 PM, TheKnto wrote:

Hello Lukáš,
at this time we have only succeeded on indexing 1 month of 1 database source on a cluster with 2 nodes (5Go RAM). Each nodes was running on a separate server but these servers were also running the couchdb job. So it's not the good config and the results may be wrong.

As for the production architecture, we have to define it in order to buy the machines, so it's quite risky to evaluate the results based on one month. I would hope that someone has experiment to share ...

best regards,
Cyril

2011/12/13 Lukáš Vlček <lukas.vlcek@gmail.com (mailto:lukas.vlcek@gmail.com)>

Hi,

You can sample portion of the data and do estimate based on this, no?

--
Regards,
Lukas

On Tuesday, December 13, 2011 at 6:05 PM, TheKnto wrote:

Hi everyone,
I'am working on a new webapp which is dedicated to search documents on several criterias.

We intend to use elasticsearch as a search engine and are very keen on it ( @Kimchy : fantastic job ! ; ) ) .

  • the documents we want to index are quite complex : several data levels , so we use nested fields in our mapping;

  • the space used in the source (3 databases couchDB) is 900 Go by year,

  • these 3 differents databases in couchDB are indexed in ES on 1 cluster: the space used by all the indexes in ES is about 2.1 To a year,

  • we have an index per month per database source (12 indexes by year per database source)

  • each index has 2 types and 1 replica, 5 shards

  • the total number of indexed documents is 15 Millions

  • the data indexed are stored on a disk bay (RAID-5)

  • several fields (about 150) are opened for search

  • we intend to use facets in queries (this will be a new functionnality and could increase the numbers of queries done and users logged),

  • each query is limited to a search period of only 1 year

  • 1000 differents users log every day to execute a mean of 2 queries


We are facing the problem of the architecture to start (number of servers, number of CPUs, power, RAM, etc ..)
We need to have a scalable solution because we will have to index 4 years of datas without decreasing perfs.

Has anyone an idea of the best approach ?
Help will be very usefull and appreciated.
Many thanks