Using the RTC definitions, ES, Hadoop, the JVM and the popular OS
themselves are "soft"/near real-time systems - so if you are coming
from a hard/firm RT system, you can safely assume that everything (and
again not just ES) is "soft". As a tangent, very few systems are hard
RT (ES is neither a nuclear factory nor a peacemaker).
As ES-Hadoop is just a connector for ES, the real-time aspect of ES
influences directly es-hadoop. I don't have any numbers at hand
however there are some aspects that you need to be aware.
Much of the real-time behaviour when it comes to search is handled
through the refresh API [1]. So when data is ingested into the system,
depending on your index settings (how many replicas, what's the
replication process - sync vs async, all vs n/2+1), the amount of data
ingested and your hardware, your data might be searcheable faster or
slower. There are so many variables here that are non standardized
that the only way to find out for yourself is to to do your own
benchmark, which is what we recommend: take a typical box, set it up,
hammer it with data and you get a base-line. Based on you have figure
on how big your cluster needs to be and as a side effect, how much
budget you have left to improve performance (by throwing more hardware
at it).
Note however that get operations are performed in real-time and are
not affected by refresh [2] - in other words data lookup is
instantaneous vs search that can be delayed (as mentioned above).
Hope this helps,
[1] Elasticsearch Platform — Find real-time answers at scale | Elastic
[2] Elasticsearch Platform — Find real-time answers at scale | Elastic
On Fri, Aug 1, 2014 at 12:36 PM, Pierre WP pierrew@widerplanet.com wrote:
I have a question about what "near real-time" means exactly, in a quantified
way, when described this way on the ES-hadoop home page:
We are happy to report that es-hadoop is being used in multiple
data-intensive environments; in a recent example, a large financial
institute that stores all of their raw access logs in Hadoop – billions of
documents – has been using es-hadoop to index the data into Elasticsearch
and then visualize it using Kibana. This approach allowed the customer to
have near real-time visibility into their data through Kibana
(Elasticsearch Platform — Find real-time answers at scale | Elastic)
I've been burned in the past by people throwing around the term "real-time"
in sloppy ways when what they really meant was update lag of many minutes.
(Coming from the hardware world we have a different way of using the term
"real-time" =D)
I'm not saying that's the case here, I'm just asking for numerical
clarification. Naturally I assume it depends on the volume of data flow, the
server equipment, and the configuratinon settings. I've done about half an
hour of general searching without any definitive answers. Hopefully someone
either knows or can point me to a good resource.
-- Pierre
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6f40a541-19b6-4a86-b02c-b07b1e3b17b3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJogdmfmHYnGLQ4E14RM0da0Rv6JXhGFrvhVxqWi9O8%2BN8B40w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.