ES-Hadoop data query liveness "near real-time" quantified?


(Pierre WP) #1

I have a question about what "near real-time" means exactly, in a
quantified way, when described this way on the ES-hadoop home page:

We are happy to report that es-hadoop is being used in multiple

data-intensive environments; in a recent example, a large financial
institute that stores all of their raw access logs in Hadoop – billions of
documents – has been using es-hadoop to index the data into Elasticsearch
and then visualize it using Kibana. This approach allowed the customer to
have near real-time visibility into their data through Kibana

(http://www.elasticsearch.org/blog/es-hadoop-2-0-g/)

I've been burned in the past by people throwing around the term "real-time"
in sloppy ways when what they really meant was update lag of many minutes.
(Coming from the hardware world we have a different way of using the term
"real-time" =D)

I'm not saying that's the case here, I'm just asking for numerical
clarification. Naturally I assume it depends on the volume of data flow,
the server equipment, and the configuratinon settings. I've done about half
an hour of general searching without any definitive answers. Hopefully
someone either knows or can point me to a good resource.

-- Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6f40a541-19b6-4a86-b02c-b07b1e3b17b3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Costin Leau) #2

Using the RTC definitions, ES, Hadoop, the JVM and the popular OS
themselves are "soft"/near real-time systems - so if you are coming
from a hard/firm RT system, you can safely assume that everything (and
again not just ES) is "soft". As a tangent, very few systems are hard
RT (ES is neither a nuclear factory nor a peacemaker).

As ES-Hadoop is just a connector for ES, the real-time aspect of ES
influences directly es-hadoop. I don't have any numbers at hand
however there are some aspects that you need to be aware.

Much of the real-time behaviour when it comes to search is handled
through the refresh API [1]. So when data is ingested into the system,
depending on your index settings (how many replicas, what's the
replication process - sync vs async, all vs n/2+1), the amount of data
ingested and your hardware, your data might be searcheable faster or
slower. There are so many variables here that are non standardized
that the only way to find out for yourself is to to do your own
benchmark, which is what we recommend: take a typical box, set it up,
hammer it with data and you get a base-line. Based on you have figure
on how big your cluster needs to be and as a side effect, how much
budget you have left to improve performance (by throwing more hardware
at it).

Note however that get operations are performed in real-time and are
not affected by refresh [2] - in other words data lookup is
instantaneous vs search that can be delayed (as mentioned above).

Hope this helps,

[1] http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-refresh.html
[2] http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-get.html

On Fri, Aug 1, 2014 at 12:36 PM, Pierre WP pierrew@widerplanet.com wrote:

I have a question about what "near real-time" means exactly, in a quantified
way, when described this way on the ES-hadoop home page:

We are happy to report that es-hadoop is being used in multiple
data-intensive environments; in a recent example, a large financial
institute that stores all of their raw access logs in Hadoop – billions of
documents – has been using es-hadoop to index the data into Elasticsearch
and then visualize it using Kibana. This approach allowed the customer to
have near real-time visibility into their data through Kibana

(http://www.elasticsearch.org/blog/es-hadoop-2-0-g/)

I've been burned in the past by people throwing around the term "real-time"
in sloppy ways when what they really meant was update lag of many minutes.
(Coming from the hardware world we have a different way of using the term
"real-time" =D)

I'm not saying that's the case here, I'm just asking for numerical
clarification. Naturally I assume it depends on the volume of data flow, the
server equipment, and the configuratinon settings. I've done about half an
hour of general searching without any definitive answers. Hopefully someone
either knows or can point me to a good resource.

-- Pierre

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6f40a541-19b6-4a86-b02c-b07b1e3b17b3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJogdmfmHYnGLQ4E14RM0da0Rv6JXhGFrvhVxqWi9O8%2BN8B40w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Pierre WP) #3

Hi Costin, thanks for the fast and detailed response. I don't have much
direct experience with ES though, so I have basically no intuition about
what the possible range of performance would be. Would you happen to have
some useful case studies that could give me some kind of idea? What is the
BEST case scenario? Could notifications be set up that could detect and
trigger anomaly warnings within 1 second of occurrence?

Looking forward to your Aug 20 webinar! Any chance you have an outline of
the presentation I could look at ahead of time? =D

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6ad2aebd-fe5a-4ec1-ab9c-e126e07e6788%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Costin Leau) #4

See http://www.elasticsearch.org/case-studies/
There are plenty of use case of organizations from all kind of industries. For IT audience, Github, Wikipedia or
Stackoverflow give some insights into
how ES is used at scale, on large volumes and data and what 'real-time' means.

As for the webinar, I'm still working on some of its content; I don't want to give any spoilers however know that I'll
do my best not to disappoint :slight_smile:

On 8/10/14 8:36 PM, Pierre WP wrote:

Hi Costin, thanks for the fast and detailed response. I don't have much direct experience with ES though, so I have
basically no intuition about what the possible range of performance would be. Would you happen to have some useful case
studies that could give me some kind of idea? What is the BEST case scenario? Could notifications be set up that could
detect and trigger anomaly warnings within 1 second of occurrence?

Looking forward to your Aug 20 webinar! Any chance you have an outline of the presentation I could look at ahead of time? =D

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6ad2aebd-fe5a-4ec1-ab9c-e126e07e6788%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6ad2aebd-fe5a-4ec1-ab9c-e126e07e6788%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/53E7B890.40308%40gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5