General architecture advice or suggestion


(Nikola Ivačič) #1

Hi all,

I need a general advice on how to structure: cluster, nodes, sharding and
indexes.
I have a small database of about 8 million articles that take about 320G in
json format including all de-normalized sub fields (objects).
Articles are added at about 10k per day. The majority of queries (about
90%) includes results for last 3 months (900k articles).
I use modified (hacked) Apache Solr for "histogram/facet" style analysis on
article fields (patched Apache Solr stats module).
For these latest 3 articles the query response should be fast (sub second),
but query response for all other articles or larger interval can be greater.
Articles are mostly queried and analysed by all kinds of tags and a date
inserted field.

I've read a book on Elasticsearch and it seems very promising (though I
still didn't get my head around all the features of the Elasticsearch).
I would like to get as many as possible suggestions on how to build a
cluster that would replace current Apache Solr+Mongodb installation.
Mostly to reduce sys admin and development/maintenance complexity.
I would like to move mostly used data as close as possible to the front end
nodes (limited disk space and ram), while having an option of rare
searching on distant whole dataset nodes (lots of disk space but still
limited ram).

To summarize:
How would one build a cluster having light weight index with only latest
articles and heavy weight index with all articles?
Is it better to just forget this concept and use date based sharding
by-the-book?
Is it possible to move replication of selected shards closer to frontend
nodes?
Substantial size of an article json object are child or parent objects
(mostly repeated small sets (<100k in total) of related tags, authors,
publisher etc...). Is it ok to use built in parent-child functionality for
these article fields since most of the analysis is done with data
aggregation of those?

Thanks for any suggestions in advance!

Nikola

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alexander Reelsen) #2

Hey,

in order to keep the system simple for a first try, I would start with one
system and time based indices (one index per month), so you can query three
indices (current month, last month, before last months) and only query all
data if needed and see if this matches your SLAs.

Moving data to your frontend nodes mostly means, that you will have an
elasticsearch instance on your frontend nodes (costing your RAM and CPU
there) and maybe only saving the network latency (which occurs anyway in a
cluster), so I do not think that this idea is really feasible.

You might want to grab a beer or a coffee (depending on your time of the
day) and check out a few of the videos at
http://www.elasticsearch.org/videos/
(I like the getting down and dirty for getting started, and in your case
the 'big data, search and analytics' one first).
Also there is a nice presentation of parent-child and nested documents by
Martijn van Groningen form this years Berlin Buzzwords, which might help
you to check if you need that functionality.

Hope this helps, if not, just ask more!

--Alex

On Tue, Sep 17, 2013 at 9:20 PM, Nikola Ivačič nikola.ivacic@gmail.comwrote:

Hi all,

I need a general advice on how to structure: cluster, nodes, sharding and
indexes.
I have a small database of about 8 million articles that take about 320G
in json format including all de-normalized sub fields (objects).
Articles are added at about 10k per day. The majority of queries (about
90%) includes results for last 3 months (900k articles).
I use modified (hacked) Apache Solr for "histogram/facet" style analysis
on article fields (patched Apache Solr stats module).
For these latest 3 articles the query response should be fast (sub
second), but query response for all other articles or larger interval can
be greater.
Articles are mostly queried and analysed by all kinds of tags and a date
inserted field.

I've read a book on Elasticsearch and it seems very promising (though I
still didn't get my head around all the features of the Elasticsearch).
I would like to get as many as possible suggestions on how to build a
cluster that would replace current Apache Solr+Mongodb installation.
Mostly to reduce sys admin and development/maintenance complexity.
I would like to move mostly used data as close as possible to the front
end nodes (limited disk space and ram), while having an option of rare
searching on distant whole dataset nodes (lots of disk space but still
limited ram).

To summarize:
How would one build a cluster having light weight index with only latest
articles and heavy weight index with all articles?
Is it better to just forget this concept and use date based sharding
by-the-book?
Is it possible to move replication of selected shards closer to frontend
nodes?
Substantial size of an article json object are child or parent objects
(mostly repeated small sets (<100k in total) of related tags, authors,
publisher etc...). Is it ok to use built in parent-child functionality for
these article fields since most of the analysis is done with data
aggregation of those?

Thanks for any suggestions in advance!

Nikola

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #3