Fielddata circuit breaker problems

Hi! So I have millions and millions of documents in my Elasticsearch, each
one of which has a field called "time". I need the results of my queries to
come back in chronological order. So I put a "sort":{"time":{"order":"asc"}}
in all my queries. This was going great on smaller data sets but then
Elasticsearch started sending me 500s and circuit breaker exceptions
started showing up in the logs with "data for field time would be too
large". So I checked out
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-fielddata.html
and that looks a lot like what I've been seeing: seems like it's trying to
pull all the millions of time values into memory even if they're not
relevant to my query. What are my options for fixing this? I can't
compromise chronological order, it's at the heart of my application. "More
memory" would be a short-term fix but the idea is to scale this thing to
trillions and trillions of points and that's a race I don't want to run.
Can I make these exceptions go away without totally tanking performance?
Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/60c63662-71b5-4e98-b125-995e357cd06e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Have you tried indexing your data using "doc_values" as your fielddata
format?

El jueves, 2 de octubre de 2014 03:29:30 UTC+2, Dave Galbraith escribió:

Hi! So I have millions and millions of documents in my Elasticsearch, each
one of which has a field called "time". I need the results of my queries to
come back in chronological order. So I put a
"sort":{"time":{"order":"asc"}} in all my queries. This was going great
on smaller data sets but then Elasticsearch started sending me 500s and
circuit breaker exceptions started showing up in the logs with "data for
field time would be too large". So I checked out
Elasticsearch Platform — Find real-time answers at scale | Elastic
and that looks a lot like what I've been seeing: seems like it's trying to
pull all the millions of time values into memory even if they're not
relevant to my query. What are my options for fixing this? I can't
compromise chronological order, it's at the heart of my application. "More
memory" would be a short-term fix but the idea is to scale this thing to
trillions and trillions of points and that's a race I don't want to run.
Can I make these exceptions go away without totally tanking performance?
Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c2ed3bac-93f7-49ad-aea9-01005aacc1ed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Wow, that really looks like it'll solve all my problems! I'm not entirely
clear from the docs on where exactly I configure that formatting, though:
can you point me in the right direction? Thanks!

On Thursday, October 2, 2014 3:21:24 AM UTC-7, Adrian Luna wrote:

Have you tried indexing your data using "doc_values" as your fielddata
format?

El jueves, 2 de octubre de 2014 03:29:30 UTC+2, Dave Galbraith escribió:

Hi! So I have millions and millions of documents in my Elasticsearch,
each one of which has a field called "time". I need the results of my
queries to come back in chronological order. So I put a
"sort":{"time":{"order":"asc"}} in all my queries. This was going great
on smaller data sets but then Elasticsearch started sending me 500s and
circuit breaker exceptions started showing up in the logs with "data for
field time would be too large". So I checked out
Elasticsearch Platform — Find real-time answers at scale | Elastic
and that looks a lot like what I've been seeing: seems like it's trying to
pull all the millions of time values into memory even if they're not
relevant to my query. What are my options for fixing this? I can't
compromise chronological order, it's at the heart of my application. "More
memory" would be a short-term fix but the idea is to scale this thing to
trillions and trillions of points and that's a race I don't want to run.
Can I make these exceptions go away without totally tanking performance?
Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1511d354-92ae-4b91-a428-8626eaf20a64%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

When creating the index, specify a mapping and use field_data. Take the
example from the documentation and embed it into a normal mapping
definition. However, I realised that it'll only work with
string/numeric/geo_point, so you may need to use timestamp for your time
field.

Hope it helps.

El jueves, 2 de octubre de 2014 03:29:30 UTC+2, Dave Galbraith escribió:

Hi! So I have millions and millions of documents in my Elasticsearch, each
one of which has a field called "time". I need the results of my queries to
come back in chronological order. So I put a
"sort":{"time":{"order":"asc"}} in all my queries. This was going great
on smaller data sets but then Elasticsearch started sending me 500s and
circuit breaker exceptions started showing up in the logs with "data for
field time would be too large". So I checked out
Elasticsearch Platform — Find real-time answers at scale | Elastic
and that looks a lot like what I've been seeing: seems like it's trying to
pull all the millions of time values into memory even if they're not
relevant to my query. What are my options for fixing this? I can't
compromise chronological order, it's at the heart of my application. "More
memory" would be a short-term fix but the idea is to scale this thing to
trillions and trillions of points and that's a race I don't want to run.
Can I make these exceptions go away without totally tanking performance?
Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8ee0351d-21a7-4fd8-9082-d9eb0b5238d0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

If you really want to sort on a timestamp, use discretization strategy for
better performance.

If you use millisecond resolution timestamp, ES will have to load all the
values of the fields, because they are unique. This is quite massive.

But if you store year, month, day counts, hour counts, minute count etc. in
different fields to the resolution you want (e.g. seconds), you can create
fields with much less unique values.

Then you can sort on multiple fields with very small memory consumption,
something like

"sort": [
{ "year": { "order": "asc" }},
{ "month": { "order": "asc" }},
{ "day": { "order": "asc" }},
{ "hour": { "order": "asc" }},
{ "min": { "order": "asc" }},
{ "sec": { "order": "asc" }}
]

Jörg

On Thu, Oct 2, 2014 at 3:29 AM, Dave Galbraith david92galbraith@gmail.com
wrote:

Hi! So I have millions and millions of documents in my Elasticsearch, each
one of which has a field called "time". I need the results of my queries to
come back in chronological order. So I put a
"sort":{"time":{"order":"asc"}} in all my queries. This was going great
on smaller data sets but then Elasticsearch started sending me 500s and
circuit breaker exceptions started showing up in the logs with "data for
field time would be too large". So I checked out
Elasticsearch Platform — Find real-time answers at scale | Elastic
and that looks a lot like what I've been seeing: seems like it's trying to
pull all the millions of time values into memory even if they're not
relevant to my query. What are my options for fixing this? I can't
compromise chronological order, it's at the heart of my application. "More
memory" would be a short-term fix but the idea is to scale this thing to
trillions and trillions of points and that's a race I don't want to run.
Can I make these exceptions go away without totally tanking performance?
Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/60c63662-71b5-4e98-b125-995e357cd06e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/60c63662-71b5-4e98-b125-995e357cd06e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGg_mOioj3EGf9NJy57gCDwNvziEPRviOzaPRu3zDNkDA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for the input Jorg but millisecond granularity is
application-critical here. I'm trying to work with the doc_values fielddata
format for time. Currently I'm sending over ISO strings, and it's parsing
them into dates internally, which I understand will stop me from using
doc_values. I can turn this off though: are there any dire
performance/workability implications from forcing it to store my dates as
strings and not parse them into dates?

On Thursday, October 2, 2014 4:12:35 PM UTC-7, Jörg Prante wrote:

If you really want to sort on a timestamp, use discretization strategy for
better performance.

If you use millisecond resolution timestamp, ES will have to load all the
values of the fields, because they are unique. This is quite massive.

But if you store year, month, day counts, hour counts, minute count etc.
in different fields to the resolution you want (e.g. seconds), you can
create fields with much less unique values.

Then you can sort on multiple fields with very small memory consumption,
something like

"sort": [
{ "year": { "order": "asc" }},
{ "month": { "order": "asc" }},
{ "day": { "order": "asc" }},
{ "hour": { "order": "asc" }},
{ "min": { "order": "asc" }},
{ "sec": { "order": "asc" }}
]

Jörg

On Thu, Oct 2, 2014 at 3:29 AM, Dave Galbraith <david92g...@gmail.com
<javascript:>> wrote:

Hi! So I have millions and millions of documents in my Elasticsearch,
each one of which has a field called "time". I need the results of my
queries to come back in chronological order. So I put a
"sort":{"time":{"order":"asc"}} in all my queries. This was going great
on smaller data sets but then Elasticsearch started sending me 500s and
circuit breaker exceptions started showing up in the logs with "data for
field time would be too large". So I checked out
Elasticsearch Platform — Find real-time answers at scale | Elastic
and that looks a lot like what I've been seeing: seems like it's trying to
pull all the millions of time values into memory even if they're not
relevant to my query. What are my options for fixing this? I can't
compromise chronological order, it's at the heart of my application. "More
memory" would be a short-term fix but the idea is to scale this thing to
trillions and trillions of points and that's a race I don't want to run.
Can I make these exceptions go away without totally tanking performance?
Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/60c63662-71b5-4e98-b125-995e357cd06e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/60c63662-71b5-4e98-b125-995e357cd06e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/65791347-1752-42cd-81e8-e1d05d479d79%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Actually, the date section of

seems to suggest that doc_values is valid for dates, even though they
aren't mentioned in
Elasticsearch Platform — Find real-time answers at scale | Elastic.
Maybe I can get away with not disabling conversion...

On Thursday, October 2, 2014 5:28:23 PM UTC-7, Dave Galbraith wrote:

Thanks for the input Jorg but millisecond granularity is
application-critical here. I'm trying to work with the doc_values fielddata
format for time. Currently I'm sending over ISO strings, and it's parsing
them into dates internally, which I understand will stop me from using
doc_values. I can turn this off though: are there any dire
performance/workability implications from forcing it to store my dates as
strings and not parse them into dates?

On Thursday, October 2, 2014 4:12:35 PM UTC-7, Jörg Prante wrote:

If you really want to sort on a timestamp, use discretization strategy
for better performance.

If you use millisecond resolution timestamp, ES will have to load all the
values of the fields, because they are unique. This is quite massive.

But if you store year, month, day counts, hour counts, minute count etc.
in different fields to the resolution you want (e.g. seconds), you can
create fields with much less unique values.

Then you can sort on multiple fields with very small memory consumption,
something like

"sort": [
{ "year": { "order": "asc" }},
{ "month": { "order": "asc" }},
{ "day": { "order": "asc" }},
{ "hour": { "order": "asc" }},
{ "min": { "order": "asc" }},
{ "sec": { "order": "asc" }}
]

Jörg

On Thu, Oct 2, 2014 at 3:29 AM, Dave Galbraith david92g...@gmail.com
wrote:

Hi! So I have millions and millions of documents in my Elasticsearch,
each one of which has a field called "time". I need the results of my
queries to come back in chronological order. So I put a
"sort":{"time":{"order":"asc"}} in all my queries. This was going great
on smaller data sets but then Elasticsearch started sending me 500s and
circuit breaker exceptions started showing up in the logs with "data for
field time would be too large". So I checked out
Elasticsearch Platform — Find real-time answers at scale | Elastic
and that looks a lot like what I've been seeing: seems like it's trying to
pull all the millions of time values into memory even if they're not
relevant to my query. What are my options for fixing this? I can't
compromise chronological order, it's at the heart of my application. "More
memory" would be a short-term fix but the idea is to scale this thing to
trillions and trillions of points and that's a race I don't want to run.
Can I make these exceptions go away without totally tanking performance?
Thanks!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/60c63662-71b5-4e98-b125-995e357cd06e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/60c63662-71b5-4e98-b125-995e357cd06e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0c28ae08-94ae-4b19-83d5-e8aa0e7fb63a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.