Hi Ram,
we have built something similar for a compliance analytics
application. Consider the following:
-
The feeding pipeline should perform any tagging, extractions,
enrichments, classification as much as possible. The results will be
indexed. Usually, that takes care of some computationally intensive
tasks (e.g., complex entity extraction, relationship extraction) and
prepares for later analytics by providing proper entities to work on. As
messages usually don't change (i.e., once indexed, you will keep them
unchanged for the rest of their lifetime), spending a bit more compute
time in feeding is fine.
-
You don't have to store the original message contents in
Elasticsearch. Try Apache Cassandra and only index a message id in
Elasticsearch, that can be used to retrieve the original message from
Cassandra or simply from a file storage (in the case of
compliance/e-discovery, it tends to be an immutable file storage). In
our application, relevant meta-data is only about 60% of the source
volume, so storing original messages somewhere else would require only
about 38% of the Elasticsearch storage required for both.
-
Your queries may become complex, but you can scale with more replica
and nodes, or simply more RAM as necessary. Unless you're talking about
SMS messages, three nodes seems tight.
-
If you need to do some query-time analytics, fetch the candidate
records and use aggregations if possible. Aggregations may not do the
entire job, but simply help finding the candidates. You may want to run
a first query to obtain just aggregations without result hits, and then
run one or more queries to get the actual candidate sets. Querying
should be considered "cheap", so having multiple queries is fine.
-
Now do the extra analytics on the query result set obtained. For this
purpose, you should to look into Apache Spark to handle fast in-memory
processing of this data set if you really have a number of small,
parallel jobs with a significant divergence of run-times. As the scaling
properties of Elasticsearch retrieval and the post-query processing will
most likely be quite different, I would not recommend using any form of
plug-in for Elasticsearch (or Solr).
-
If I take the dimensioning from my application and calculate that for
600 M e-mail messages, I would get (average size of 10 kB ex
attachments, plus derived meta-data of approx. another 6 kB of text)
around 10 TB of raw data. Three nodes seems to be a bit short for this
application. I don't know about the RAM and CPU sizings in your case,
but you should consider going to a definitely larger number of nodes.
Some thoughts... your mileage may vary
Best regards,
--Jürgen
On 12.12.2014 06:04, Ramchandra Phadake wrote:
Hi,
We are storing lots of mail messages in ES with multiple fields. 600
Millions+ messages across 3 ES nodes.
There is a custom algorithm which works on batch of messages to
correlate based on fields & other message semantics.
Final result involves groups of messages returned similar to say field
collapsing type results.
Currently we fetch 100K+ messages from ES & apply this logic to return
final results to user. The algo can't be modeled using aggregations.
Obviously this is not scalable approach if say we want to process 100
M messages as part of this processing & return results in few mins.The
messages are large & partitioned across few ES nodes. We want to main
data locality while processing so as not to download lots of data from
ES over network.
Any way to execute some code over shards from within ES, fine if done
as part of postFilter as well. What are options available before
thinking about Hadoop/Spark using es-hadoop library?
Solr seems to be having such a plugin hook(experimental) for custom
processing.
https://cwiki.apache.org/confluence/display/solr/AnalyticsQuery+API
Thanks,
Ram
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com
mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f98a4bcb-2d9b-4aca-b49d-9afce519a69a%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f98a4bcb-2d9b-4aca-b49d-9afce519a69a%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.
--
Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant
Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wagner@devoteam.com
mailto:juergen.wagner@devoteam.com, URL: www.devoteam.de
http://www.devoteam.de/
Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/548A93B2.4080006%40devoteam.com.
For more options, visit https://groups.google.com/d/optout.