Use _source fields from other index for highlighting

Hi,

I have documents that are grouped in families. I need the possibility to search at the document level and at the family level. That is why I think I need two indexes : the first one will map my source document to an elasticsearch document and the second will map a family of my source documents to an elasticsearch document.
My question is : do I have to keep the _source field in my two indexes? I have a lot of data and I don't want to duplicate it in the two indexes. I have two keep the _source field because I want to use partial update on documents and all highlighting, but I was wondering if it is possible to keep it only in one index (family level index or document level index). For example, if I keep _source in the document-level index, and want to update a document, I can do a partial update and rebuild the family from the _source of its children (I'll probably have to do some code, but I think it won't be too complicated...)

The question is : can I implement the same behavior for highlighting? In others words, in order to highlight a family, am I able to implement a kind of "mapper" that will get _source fields of the documents in the document level index to highlight results of my family index? I hope my explanation is clear :-).

Thanks

It is a little confusing.

I'm not sure what you mean by family but can you not just have one index for all your documents and have a field called "family" when you index the document? If you need to search in a family, then in your query, use a term filter and set the family.

Really not sure what you're trying to achieve with the highlighting but I believe you can solve your problem with the one index without this duplication. If I'm mistaken, it'd be helpful if you could given an example.

Hi,

Thank you for your answer. The problem is that we won't be able to do cross-document search if we only have one index that contains all documents that contain a family field to identify the family they belong to. Let's consider the following example. We have two documents that belong to a same family. One of these documents contains "hard drive" in the field title. The other one contains "RAM" in its field title. We want to find a family that contains both hard drive and RAM in its children. That won't be possible if we use a single index (or maybe it will if we use a parent/child query, but then we'll have some performance penalty). That is why I want to flatten the data in a specific index, "family", where I will index RAM and hard drive in a multivalue field title and then be able to query that index when I want to search families. Is that clear enough :-S?

Thanks,

I'm not sure what you mean "won't be able to do cross-document search".

You can use a bool query with should clauses for each thing you're searching. e.g.

POST /_search
{
  "query": {
    "filtered" : {
      "query": {
        "bool": {
         "should: [
           {"match": { "title": "Hard drive"}},
           {"match": { "title": "RAM"}}
          ],
          "minimum_should_match" : 1
         }
       } 
    }
  }
}

That will give you documents with either "hard drive", "RAM", or both. The ones with both will be at the top as they will have a higher score. If you know these documents should be tagged with "RAM" and/or "hard drive", why not add a multivalve field called tags in them too.

I don't see the need for a second index where the same data is indexed differently.

Ok, I see what you mean. I will try to refine my example :slight_smile:

To simplify, let's consider the following documents :

{
  id: "DocA1",
  title: "hard drive",
  family: "A"
}
{
  id: "DocA2",
  title: "RAM",
  family: "A"
}
{
  id": "DocB1",
  title: "SSD",
  family: "B"
}
{
  id: "DocB2",
  title: "RAM",
  family: "B"
}

I want families that contains "hard drive" and "RAM". The only one that is correct in my exemple is family "A".
If I run your query, I'll get family A AND family B. If I change minimum_should_match to 2, I won't get any of them...
That is why I think I need a second index to flatten the data...

Thanks,

Are you looking for a family or are you specifying a family?

If you specify a family:

{
  "query": {
    "filtered" : {
      "query": {
        "bool": {
         "should: [
           {"match": { "title": "Hard drive"}},
           {"match": { "title": "RAM"}}
          ],
          "minimum_should_match" : 1
         }
       },
       "filter": { "term": {"family":"A"}} 
    }
  }
}

If you are looking for families that documents fall into, then maybe still do the same query and do a terms aggregation for family and see which families have the highest count?

It's worth first simply modeling what it is that you are asking of that data and see that it has no contradictions. You want families of documents that contain "hard drive" and "RAM"" means you're doing a search for something of type family which is related to something of type document that has "hard drive" or "RAM" in its title. If this was in SQL and you had a family table and document table, how would the query look like so it only gives you back only "A"?

Sorry but I am not sure to understand every thing you asked me to do. I'll try to answer anyway : please tell me if I miss something :-).

For your first question, the answer is definitely yes : I am looking for families.
In the example above : the result that I expect is just : "A".
Family A contains 1 document which title is "hard drive" and 1 document which title is "RAM". For me, this means that family A contains "hard drive" and "RAM" in title fields.
I am not sure that it is possible to do what I want in sql (maybe that is also why I need a tool like ES :slight_smile: ).
Anyway, I know that if I create a second index (a "family" index) : I will be able to do what I want. Let's consider again the example I used :
I have a first index (the one I called "document index") with the documents :

{
id: "DocA1",
title: "hard drive",
family: "A"
}
{
id: "DocA2",
title: "RAM",
family: "A"
}
{
i": "DocB1",
title: "SSD",
family: "B"
}
{
id: "DocB2",
title: "RAM",
family: "B"
}

I can query this index if I want to search at document level. Now I want to build a "family index" from these documents. I will have to aggregate these documents to create families. The result of this aggregration is the two following family documents, that I will be able to index in my "family index" :

{
title: [ "Hard drive", "RAM"],
id: "A"
}
{
title: [ "SSD", "RAM" ],
id: "B"
}

In that index, it is easy to run this query :

{
  "query": {
    "filtered" : {
      "query": {
        "bool": {
         "must: [
           {"match": { "title": "Hard drive"}},
           {"match": { "title": "RAM"}}
          ]
         }
       }
     }
  }
}

That returns: "A", the result that I was looking for.
I hope it is more clear now :-S

Thank you.

so your question is, should you keep your _source fields or not, right?

If you don't keep your _source, you have to make sure to explicitly store the fields you want. Also, should you need the original data for reindexing, you'd have to get that from somewhere else (original data). Nothing preventing you from doing this.

As for highlighting, assuming you do the final query, what would highlighting even look like? Your second index isn't duplicating the full titles, is it? It just has the short terms you're searching for so highlighting really isn't that useful in that case since what you'll be getting are family items.

Unless, what you want to do is two queries. Done one query to get the family, then take that family and titles and feed those into query into the first document index in there filter and match to get the highlighting on the document index.

Sorry for the late reply.
Thank you for your advice.
If I run a 2 queries (the first one to get the list of results from the "family" index, and the second one to highlight terms that matched the first query with the help of the "document" index), I am afraid that the results won't be totally correct. In fact, I need the require_field_match fonctionnaly of highlighter. If I use a 2nd specific query for highlighting, I won't be able to use the require_field_match fonctionnaly (or I use it, it won't have the correct behavior as the 2nd query will be different from the first one). But maybe I misunderstood what you asked me to do.

Thank you