Aggregate vs facets vs nested documents?

We just started to use logstash for parsing logs and adding the data to
ElasticSearch. Using kibana, we can get some good visualizations on our
process runs

I have the records in ES as follows ( these are basically log entries from
log stash)

Timestamp

TxnId

Caller

EventName

Duration

Other Data..

2013-12-20T22:35:17.109365+00:00

9f3c264b-8ee3-4b2a-ac16-162290c7cf45

Worker

Start

..

2013-12-20T22:38:17.109365+00:00

9f3c264b-8ee3-4b2a-ac16-162290c7abcd

Worker

Start

..

2013-12-20T22:40:17.109365+00:00

9f3c264b-8ee3-4b2a-ac16-162290c7cf45

Worker

End

20

..

2013-12-20T22:42:17.109365+00:01

9f3c264b-8ee3-4b2a-ac16-162290c7abcd

Worker

End

40

..

2013-12-20T22:45:17.109365+00:00

9f3c264b-8ee3-4b2a-ac16-162290c7efgh

Worker

Start

..

Basically this is a log of worker process and their start and finish
timesFor a given time period, we would like to know all worker processes
that have started and ended and worker process that have not ended. For
e.g.: the result for the data above would be

Timestamp

TxnId

Caller

EventName

Other

Duration

Other Data..

2013-12-20T22:35:17.109365+00:00

9f3c264b-8ee3-4b2a-ac16-162290c7cf45

Worker

Start

Ended

20

..

2013-12-20T22:38:17.109365+00:00

9f3c264b-8ee3-4b2a-ac16-162290c7abcd

Worker

Start

Ended

40

..

2013-12-20T22:45:17.109365+00:00

9f3c264b-8ee3-4b2a-ac16-162290c7efgh

Worker

Start

..

This can be achieved in sql by grouping on TxnId. We could ideally filter
this further with having to find workers that have not ended ( and how long
they have been working/timeout). I think this is basically a co-relation
on TxnId and EventName.

I have looked at facets (and aggregations) , nested documents, parent
child objects. I did not find lot of documentation and examples on new way
of faceting ( aggregations). It seems to me that it would be ideal if I
could build another index that uses TxnId as co-relation and then I can get
the results I want. I guess this is pretty close to what facets can do. I
am still not sure that if I facet these documents by TxnId would I be able
to run a query across all faceted entries asking for start time and end
time. Can I build an alternate index of nested documents based on TxnId ?
If so, I can pretty much query what I want . The other alternative is to
update the document with appropriate data while reading from log entries.
It would be great if someone can point me in the right direction here.

Thanks,

Vijay

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6c80a812-7c79-4794-98ca-c13e367302b6%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

Your data model is assuming a 1-N relationship between transactions and
worker events. There several ways that you can solve this kind of issues
with Elasticsearch:

  • denormalization,
  • nested documents,
  • parent/child relations.

Maybe the easiest way to do it would be to store data directly in the
expected format. On a start event, you would insert a new transaction into
your index using the transaction id as a document id, and later when the
transaction ends, you could update the transaction to record the fact that
the transaction finished and its duration. With this option, data is
indexed in a way that is easily searchable and you'll be able to leverage
all the power of aggregations facets to compute things like the number of
non-terminated transactions, the average duration, etc.

However, if you have a very high ingestion rate, this option might become a
bit too slow, in which case you might want instead to record transactions
as parent documents and events as child documents using parent/child
relations. This will make indexing faster but require more memory and
you'll have less power at query time. For example, finding transactions
that started but didn't finished would require to find the start events,
then resolve their transaction ids, then find the end events, resolve their
transaction ids and finally to compute the difference of the two sets of
transactions. This kind of query would typically execute much slower than
if you already had all information in a single transaction document (as
with the previous option). Moreover computing things like the average
duration of transactions using facets or aggregations wouldn't be possible
anymore with parent/child relations.

See
http://www.elasticsearch.org/blog/managing-relations-inside-elasticsearch/for
more information.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7c9-M999xebmfMgrhv%2BpA51R1EeF0oTuZ5hXycf0y0DA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Adrien,
Thanks for the response. Since the rate of log generation is not too higher at the moment, I will try option 1 at the moment.

Really appreciate all the help.

Regards,
Vijay

PS - Sent from my phone

On Dec 24, 2013, at 5:39 AM, Adrien Grand adrien.grand@elasticsearch.com wrote:

Hi,

Your data model is assuming a 1-N relationship between transactions and worker events. There several ways that you can solve this kind of issues with Elasticsearch:

  • denormalization,
  • nested documents,
  • parent/child relations.

Maybe the easiest way to do it would be to store data directly in the expected format. On a start event, you would insert a new transaction into your index using the transaction id as a document id, and later when the transaction ends, you could update the transaction to record the fact that the transaction finished and its duration. With this option, data is indexed in a way that is easily searchable and you'll be able to leverage all the power of aggregations facets to compute things like the number of non-terminated transactions, the average duration, etc.

However, if you have a very high ingestion rate, this option might become a bit too slow, in which case you might want instead to record transactions as parent documents and events as child documents using parent/child relations. This will make indexing faster but require more memory and you'll have less power at query time. For example, finding transactions that started but didn't finished would require to find the start events, then resolve their transaction ids, then find the end events, resolve their transaction ids and finally to compute the difference of the two sets of transactions. This kind of query would typically execute much slower than if you already had all information in a single transaction document (as with the previous option). Moreover computing things like the average duration of transactions using facets or aggregations wouldn't be possible anymore with parent/child relations.

See Elasticsearch Platform — Find real-time answers at scale | Elastic for more information.

--
Adrien Grand

You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/sAm6gAXIxW4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7c9-M999xebmfMgrhv%2BpA51R1EeF0oTuZ5hXycf0y0DA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4917E984-F2B9-4C02-B39C-7AE5A6E227FD%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.