Need some help / idea about architecture

TheKnto · December 13, 2011, 5:05pm

Hi everyone,
I'am working on a new webapp which is dedicated to search documents on
several criterias.

We intend to use elasticsearch as a search engine and are very keen on it (
@Kimchy : fantastic job ! ; ) ) .

the documents we want to index are quite complex : several data levels ,
so we use nested fields in our mapping;
the space used in the source (3 databases couchDB) is 900 Go by year,
these 3 differents databases in couchDB are indexed in ES on 1 cluster:
the space used by all the indexes in ES is about 2.1 To a year,
we have an index per month per database source (12 indexes by year per
database source)
each index has 2 types and 1 replica, 5 shards
the total number of indexed documents is 15 Millions
the data indexed are stored on a disk bay (RAID-5)
several fields (about 150) are opened for search
we intend to use facets in queries (this will be a new functionnality and
could increase the numbers of queries done and users logged),
each query is limited to a search period of only 1 year
1000 differents users log every day to execute a mean of 2 queries

We are facing the problem of the architecture to start (number of servers,
number of CPUs, power, RAM, etc ..)
We need to have a scalable solution because we will have to index 4 years
of datas without decreasing perfs.

Has anyone an idea of the best approach ?
Help will be very usefull and appreciated.
Many thanks

Lukas_Vlcek1 · December 13, 2011, 5:11pm

Hi,

You can sample portion of the data and do estimate based on this, no?

--
Regards,
Lukas

On Tuesday, December 13, 2011 at 6:05 PM, TheKnto wrote:

Hi everyone,
I'am working on a new webapp which is dedicated to search documents on several criterias.

We intend to use elasticsearch as a search engine and are very keen on it ( @Kimchy : fantastic job ! ; ) ) .

the documents we want to index are quite complex : several data levels , so we use nested fields in our mapping;

the space used in the source (3 databases couchDB) is 900 Go by year,

these 3 differents databases in couchDB are indexed in ES on 1 cluster: the space used by all the indexes in ES is about 2.1 To a year,

we have an index per month per database source (12 indexes by year per database source)

each index has 2 types and 1 replica, 5 shards

the total number of indexed documents is 15 Millions

the data indexed are stored on a disk bay (RAID-5)

several fields (about 150) are opened for search

we intend to use facets in queries (this will be a new functionnality and could increase the numbers of queries done and users logged),

each query is limited to a search period of only 1 year

1000 differents users log every day to execute a mean of 2 queries

We are facing the problem of the architecture to start (number of servers, number of CPUs, power, RAM, etc ..)
We need to have a scalable solution because we will have to index 4 years of datas without decreasing perfs.

Has anyone an idea of the best approach ?
Help will be very usefull and appreciated.
Many thanks

TheKnto · December 13, 2011, 5:35pm

Hello Lukáš,
at this time we have only succeeded on indexing 1 month of 1 database
source on a cluster with 2 nodes (5Go RAM). Each nodes was running on a
separate server but these servers were also running the couchdb job. So
it's not the good config and the results may be wrong.

As for the production architecture, we have to define it in order to buy
the machines, so it's quite risky to evaluate the results based on one
month. I would hope that someone has experiment to share ...

best regards,
Cyril

2011/12/13 Lukáš Vlček lukas.vlcek@gmail.com

Hi,

You can sample portion of the data and do estimate based on this, no?

--
Regards,
Lukas

On Tuesday, December 13, 2011 at 6:05 PM, TheKnto wrote:

Hi everyone,
I'am working on a new webapp which is dedicated to search documents on
several criterias.

We intend to use elasticsearch as a search engine and are very keen on it
( @Kimchy : fantastic job ! ; ) ) .

the documents we want to index are quite complex : several data levels ,
so we use nested fields in our mapping;

the space used in the source (3 databases couchDB) is 900 Go by year,

these 3 differents databases in couchDB are indexed in ES on 1 cluster:
the space used by all the indexes in ES is about 2.1 To a year,

we have an index per month per database source (12 indexes by year per
database source)

each index has 2 types and 1 replica, 5 shards

the total number of indexed documents is 15 Millions

the data indexed are stored on a disk bay (RAID-5)

several fields (about 150) are opened for search

we intend to use facets in queries (this will be a new functionnality
and could increase the numbers of queries done and users logged),

each query is limited to a search period of only 1 year

1000 differents users log every day to execute a mean of 2 queries

We are facing the problem of the architecture to start (number of servers,
number of CPUs, power, RAM, etc ..)
We need to have a scalable solution because we will have to index 4 years
of datas without decreasing perfs.

Has anyone an idea of the best approach ?
Help will be very usefull and appreciated.
Many thanks

Lukas_Vlcek1 · December 13, 2011, 8:30pm

Sadly, I do not think anybody can give you guaranteed answer because it depends on many factors, it is not only about number of documents and number of queries. It is also important how exactly you structure the documents before indexing and which analysis you execute on it, and on the type of queries (simple, facets, prefix, … etc). Also I bet you if you just started with the ES stuff then I am sure sooner or later you will find that you want change your documents mapping and analysis or rewrite some queries… this will have impact as well. Also you should consider how you want to go about upgrades if complete reindexing will be necessary.
Can't you for example rent AWS for some time and do capacity planning there? May be that would help.

But may be someone can share some experience, I personally do not have experience with that big data in ES.

--
Regards,
Lukas

On Tuesday, December 13, 2011 at 6:35 PM, TheKnto wrote:

Hello Lukáš,
at this time we have only succeeded on indexing 1 month of 1 database source on a cluster with 2 nodes (5Go RAM). Each nodes was running on a separate server but these servers were also running the couchdb job. So it's not the good config and the results may be wrong.

As for the production architecture, we have to define it in order to buy the machines, so it's quite risky to evaluate the results based on one month. I would hope that someone has experiment to share ...

best regards,
Cyril

2011/12/13 Lukáš Vlček <lukas.vlcek@gmail.com (mailto:lukas.vlcek@gmail.com)>

Hi,

You can sample portion of the data and do estimate based on this, no?

--
Regards,
Lukas

On Tuesday, December 13, 2011 at 6:05 PM, TheKnto wrote:

Hi everyone,
I'am working on a new webapp which is dedicated to search documents on several criterias.

We intend to use elasticsearch as a search engine and are very keen on it ( @Kimchy : fantastic job ! ; ) ) .

the documents we want to index are quite complex : several data levels , so we use nested fields in our mapping;

the space used in the source (3 databases couchDB) is 900 Go by year,

these 3 differents databases in couchDB are indexed in ES on 1 cluster: the space used by all the indexes in ES is about 2.1 To a year,

we have an index per month per database source (12 indexes by year per database source)

each index has 2 types and 1 replica, 5 shards

the total number of indexed documents is 15 Millions

the data indexed are stored on a disk bay (RAID-5)

several fields (about 150) are opened for search

we intend to use facets in queries (this will be a new functionnality and could increase the numbers of queries done and users logged),

each query is limited to a search period of only 1 year

1000 differents users log every day to execute a mean of 2 queries

We are facing the problem of the architecture to start (number of servers, number of CPUs, power, RAM, etc ..)
We need to have a scalable solution because we will have to index 4 years of datas without decreasing perfs.

Has anyone an idea of the best approach ?
Help will be very usefull and appreciated.
Many thanks

Topic		Replies	Views
Questions related to ES cluster architecture Elasticsearch	3	347	July 6, 2017
Question about ES w/ Couch(or any other db) Elasticsearch	7	638	July 6, 2017
Questions about architecting and design Elasticsearch	2	366	July 6, 2017
ES indexing strategy Elasticsearch	4	3087	July 5, 2017
Dealing with large documents (architecture question) Elasticsearch	4	610	September 25, 2017

Need some help / idea about architecture

Related topics