AFAIK in the current version, query results were always returned from the fields stored in the fdt file. To get values from the doc values (dvd file) you had to use the aggregation framework.
AFAIK in the coming version 2.0, all fields will be stored by default as doc values (minus analysed text fields).
Question: will query results be read from the dvd file (doc values) and not from the slower fdt file ?
Query results will still be loaded from the stored fields, meaning the fdt file. It is indeed slower if your index fits entirely in RAM, but it also knows about the json structure and provides much better latency if you happen to have lots of fields or a larger index than your main memory.
I am trying to implement a join in ES(sort of). What I need is to retrieve a single field ( the _IDs of linked documents) and the _ID of the containing document. I am using a _count + "facets" aggregation.
Is this the best\only way to retrieve a single field value ?
If you want to retrieve all matching _id values, then this option has the downside that it does not support pagination, so your only option is to retrieve every id in a single request, which will not scale if you have a large index.
In the future, we might be able to optimize the single-field use-case by going to doc values instead of stored fields, but this is something which is not implemented today. Additionally, it might prove challenging for some fields. For instance for dates, we store a formatted date in stored fields while doc values only store the timestamp in milliseconds.
I assume that you already know of parent/child relations which allow to perform (limited) joins?
Thanks so much again.
as you said - parent\child does not support many-to-many links, so It does not help my case.
I intend to store the source _ID, link type and target _ID in a single multi-valued field. this will give the full link while accessing a single location in the doc values column.
Is it possible to write a custom aggregation that will scale better than facets for my use case?
Many-to-many relations are very hard to deal with in a distributed system. At least the one-to-many case can colocate data that have relations in the same partition, but this is generally not possible with many-to-many relations. To be honest, I think the only way to tackle such a problem would be to use some heuristics and eg. only follow relations of the top matches of the first query (similarly to how the top_children query did).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.