I am implementing a search engine using ES for my master thesis.
The dataset contains email documents, with metadata or features about the email conversations but also the body of the email conversations as parent-child or nested as a list collection to the document. The users of my search engine need to access these email conversations.
Many fields of the documents in my dataset have empty or NaN values, (merely because data was old) but the recent the data is, the less empty values.
As I am not storing these empty fields, I am concerned about the score of this documents being lower than the most recent ones, as the fields being queried by users are the same for all documents. All of these under the assumption that for a given user query, an old email conversation and a recent one are equally valuable.
My query strategy is query the metadata of the emails, which all fields are keywords, instead of querying the nested email conversations as it may slow down the retrieval.
Any clues, research, tips on how data density affect information retrieval of ES?
How would I benefit from using a separate index for email conversations?
Thanks in advance, hope I was clear with the explanation.