How expensive is the Source Filtering?

Hi,
Pretty big documents are stored in the Index but with not so much of fields indexable. let's say only 5% of fields are indexed but the rest just stored in _source field.
As mentioned the document is so big and when no Source Filtering is set, it takes a lot of time to have the result (just because of IO, not search). To have records in a reasonable time we decided to use Source filter but I am not sure how expensive is it for Elasticsearch (lucenc) to apply the projection. Please let me know if it would not be a source of performance issues or we should break down the document into more than one index.

Thanks in advance

“Source” is stored as a blob of JSON in the underlying Lucene storage. It can be filtered but not without incurring the cost of reading the full JSON from Lucene.
At index time you can choose to extract selected fields for storage in Lucene where they can be retrieved at query time individuallly without needing to parse full JSON. See https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-store.html

1 Like

@Mark_Harwood thanks for replay.
I just wanted to get some insights into the size of documents. We have document sizes of around 400KB (~1M 3 of them). I know it depends on underlying hardware, IO and etc. But is these kind of numbers are normal or our documents are too big?

One more thing, I take a look at the Explain and Profile APIs which I think are good more to investigate the index/search issues (if I got them right). Is there any other way to assess the IO times and numbers?

Thanks in advance

Those are relatively large documents, at least based on my experience.

1 Like

Rally is our benchmarking tool

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.