Hi,
Pretty big documents are stored in the Index but with not so much of fields indexable. let's say only 5% of fields are indexed but the rest just stored in _source field.
As mentioned the document is so big and when no Source Filtering is set, it takes a lot of time to have the result (just because of IO, not search). To have records in a reasonable time we decided to use Source filter but I am not sure how expensive is it for Elasticsearch (lucenc) to apply the projection. Please let me know if it would not be a source of performance issues or we should break down the document into more than one index.
“Source” is stored as a blob of JSON in the underlying Lucene storage. It can be filtered but not without incurring the cost of reading the full JSON from Lucene.
At index time you can choose to extract selected fields for storage in Lucene where they can be retrieved at query time individuallly without needing to parse full JSON. See https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-store.html
@Mark_Harwood thanks for replay.
I just wanted to get some insights into the size of documents. We have document sizes of around 400KB (~1M 3 of them). I know it depends on underlying hardware, IO and etc. But is these kind of numbers are normal or our documents are too big?
One more thing, I take a look at the Explain and Profile APIs which I think are good more to investigate the index/search issues (if I got them right). Is there any other way to assess the IO times and numbers?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.