Querying on large docs

m4kkur0 · August 27, 2023, 6:13pm

Hello all,
I would like to know what are some good options to query an index that each doc in it structured like:
field1, keyword
field2, long
field3, object, enabled: false (mostly below 1 mb but sometimes goes up to 50 mb)

what I want to achieve is,

query by field1 and field2 to retrieve field3. This one retrieves only 1 doc
query by field1 and return only field2 value. This returns multiple docs

When I tried the second option, it took too much time even if I exclude field3 from _source in the query
So, would using docvalue_fields be a solution for second option? Would it still take too much heap memory because of field3?
What do you suggest for option 1?

carly.richmond · August 29, 2023, 10:54am

Hi @m4kkur0,

Welcome to the community! How much time is too much time for your first query?

Can you share the queries you are running? It might also be useful to run the queries through the Search Profiler or the Profiler API to see how long each query stage is taking.

m4kkur0 · August 29, 2023, 11:33am

Hello and thank you for your reply!
my first query is

size: 1
query:
bool: must:
[
term: field1: any keyword value,
match: field2: any long value
]

second query

size: 10000
query:
term: field2: any keyword value
_source: [field1]

and also query 3 which I forgot to mention about it.
In this one, we also retrieve the field 4 which is a keyword field.

size: 10000
query: bool: must
[
range: field1: gte to a long value and lte to a long val,
term: field2: a keyword value
_source :[field4, field2]
]

this query runs pretty fast actually (200 to 300 ms), how much time it takes is not a problem. What concerns me is the memory usage and errors that may occur with memory usage.

For the second and third queries, it takes more than 30 seconds and sometimes node disconnected exception occurs. So I tried setting the _source to false and using the docvalue_fields for this two queries, then it takes about 200 ms. But what concerns me, again, memory usage.

Is using the docvalue_field solves node disconnected errors in this case? Is it way more lighter than using _source?

Sorry, sending this from my mobile so shortened queries and errors may be in my text

carly.richmond · August 29, 2023, 12:43pm

Thanks for confirming @m4kkur0. If memory usage is your main concern I would recommend using docvalue_fields. There is a bit more detail in this section of the documentation that discusses the loading of the document when using the _source attribute.

With _source usage the entire Lucene document will be loaded:

A document’s _source is stored as a single field in Lucene. This structure means that the whole _source object must be loaded and parsed even if you’re only requesting part of it.

m4kkur0 · August 29, 2023, 12:53pm

Thanks!

system · September 26, 2023, 12:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Extremely Large Documents: Querying and Dealing with Elasticsearch	17	2902	October 28, 2021
Do Doc Values allow for sorted scrolling over a billion of documents? Elasticsearch	7	1590	July 6, 2017
Comparing Large Text Documents -- Queries with Large Text Fields Elasticsearch	2	923	July 6, 2017
Query taking longer times than expected, possible ways of optimization at query level Elasticsearch	3	217	June 29, 2023
Fetching one field from every inner hit (docvalue_field VS source filtering) Elasticsearch	2	894	June 2, 2020

Querying on large docs

Related Topics