By the way, is it really helpful in matter of search performance? Meaning that making a term query (BuyersNames include John) is much faster than making a nested query (BuyersNames.Name = John)?
It is not deprecated yet, we are only discussing it for now.
Conceptually, yes, your document would be indexed as if it had "BuyersNames" : ["John", "George"]. However, note that the _source will not reflect this.
By the way, is it really helpful in matter of search performance? Meaning that making a term query (BuyersNames include John) is much faster than making a nested query (BuyersNames.Name = John)?
Yes, this will be much faster. I can't quantify, but probably noticeable.
I have tried this mapping today, and it worked just as expected.
Thank you for the remark about the _source, it was very confusing at first!
One last discussion: how smart is this copying?
If I have 2 buyers with the name "John", then BuyersNames array will include "John" twice? I believe it will and it's fine This way I can count the number of buyers by the length of BuyersNames instead of following the nested documents.
If a buyer is deleted, then his/her name will be removed from BuyersNames? I believe it won't - and it's fine. After all it is called "copy" not "binding", and in addition it's not common to delete a nested document nor updating it.
Does include_in_* behave the same - concerning questions 1+2?
If I have 2 buyers with the name "John", then BuyersNames array will include "John" twice? I believe it will and it's fine This way I can count the number of buyers by the length of BuyersNames instead of following the nested documents.
Actually keyword fields are aware of duplicates for scoring, but not for aggregations. So if you count the number of buyers, all johns will count as 1.
If a buyer is deleted, then his/her name will be removed from BuyersNames? I believe it won't - and it's fine. After all it is called "copy" not "binding", and in addition it's not common to delete a nested document nor updating it.
It will be removed. Elasticsearch does not perform in-place updates. If you update a document, we actually compute the new updated document and reindex it entirely.
Does include_in_* behave the same - concerning questions 1+2?
Actually keyword fields are aware of duplicates for scoring, but not for aggregations. So if you count the number of buyers, all johns will count as 1.
Didn't know that, it's very interesting!
Lets say I've got 2 documents, each with 3 Johns. When a terms aggregation will be made about all the documents who's got "john", the result will be "john": 6, or "john": 2?
Hey again @jpountz. Sorry for bumping again the thread, but I want to ask again about your last insight concerning "exactly once" in Keyword data type. Since Array data type and Object data type act like their own inner data types (Keyword in my use case), it means that they can't assist here. Does it mean that only Nested data type containing a Keyword field will do the job, in case I need to count duplicate values?
By the way, I think it should really be documented in Keyword data type.
I think you are right that only nested could do the job in that case indeed. For the record, this is not specific to keywords, but to aggregations: aggregations count documents, not values. So if a given document has twice the same value, it still only counts 1 since there is only 1 document. I think this is expected since the name of the field in the response is called doc_count?
Actually what you have said totally makes sense. Haven't noticed until now that some aggregations return "doc_count" while others return "value". Thanks again!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.